Zero-inflated model

{{short description|Statistical model allowing for frequent zero values}}

In statistics, a zero-inflated model is a statistical model based on a zero-inflated probability distribution, i.e. a distribution that allows for frequent zero-valued observations.

Introduction to zero-inflated models

Zero-inflated models are commonly used in the analysis of count data, such as the number of visits a patient makes to the emergency room in one year, or the number of fish caught in one day in one lake.{{Citation

|last1= Bilder

|first1= Christopher

|last2= Loughin

|first2= Thomas

|title= Analysis of Categorical Data with R

|edition=First

|year=2015

|publisher= CRC Press / Chapman & Hall

|isbn=978-1439855676

}}

Count data can take values of 0, 1, 2, … (non-negative integer values).{{Citation

|last1= Hilbe

|first1= Joseph M.

|title= Modeling Count Data

|edition=First

|year=2014

|publisher= Cambridge University Press

|isbn= 978-1107611252 }}

Other examples of count data are the number of hits recorded by a Geiger counter in one minute, patient days in the hospital, goals scored in a soccer game,{{Citation

|last1= Hilbe

|first1= Joseph M.

|title= Negative Binomial Regression

|edition=Second

|year=2007

|publisher= Cambridge University Press

|isbn= 978-0521198158

}}

and the number of episodes of hypoglycemia per year for a patient with diabetes.{{Citation

|last1= Lachin

|first1= John M.

|title= Biostatistical Methods: The Assessment of Relative Risks

|edition= Second

|year=2011

|publisher= Wiley

|isbn= 978-0470508220

}}

For statistical analysis, the distribution of the counts is often represented using a Poisson distribution or a negative binomial distribution. Hilbe notes that "Poisson regression is traditionally conceived of as the basic count model upon which a variety of other count models are based." In a Poisson model, "… the random variable y is the count response and parameter \lambda (lambda) is the mean. Often, \lambda is also called the rate or intensity parameter… In statistical literature, \lambda is also expressed as \mu (mu) when referring to Poisson and traditional negative binomial models."

In some data, the number of zeros is greater than would be expected using a Poisson distribution or a negative binomial distribution. Data with such an excess of zero counts are described as Zero-inflated.

Example histograms of zero-inflated Poisson distributions with mean \mu of 5 or 10 and proportion of zero inflation \pi of 0.2 or 0.5 are shown below, based on the R program ZeroInflPoiDistPlots.R from Bilder and Laughlin.

File:Histograms_of_ZIP_distributions.jpg

=Examples of zero-inflated count data=

  • Fish counts "… suppose we recorded the number of fish caught on various lakes in 4-hour fishing trips to Minnesota. Some lakes in Minnesota are too shallow for fish to survive the winter, so fishing in those lakes will yield no catch. On the other hand, even on a lake where fish are plentiful, we may or may not catch any fish due to conditions or our own competence. Thus, the number of fish caught will be zero if the lake does not support fish, and will be zero, one or more if it does."
  • Number of wisdom teeth extracted.{{cite web|url=https://www.youtube.com/watch?v=14B5QUUmqts|title= Biostatistics II. 1.3 - Zero-inflated Models |website= YouTube | access-date=July 1, 2022}} The number of wisdom teeth that a person has had extracted can range from 0 to 4. Some individuals, about one-third of the population, do not have any wisdom teeth. For these individuals, the number of wisdom teeth extracted will always be zero. For other individuals, the number extracted will be between 0 and 4, where a 0 indicates that the subject has not yet, and may never, have any of their 4 wisdom teeth extracted.
  • Publications by PhD candidates.{{Citation

|last1= Long

|first1= J. Scott

|title= Regression Models for Categorical and Limited Dependent Variables

|edition=First

|year=1997

|publisher= Sage Publications

|isbn= 978-0803973749

}}

Long examined the number of publications by 915 doctoral candidates in biochemistry in the last three years of their PhD studies. The proportion of candidates with zero publications exceeded the number predicted by a Poisson model. "Long argued that the PhD candidates might fall into two distinct groups: "publishers" (perhaps striving for an academic career) and "non-publishers" (seeking other career paths). One reasonable form of explanation is that the observed zero counts reflect a mixture of the two latent classes – those who simply have not yet published and those who will likely never publish."{{Citation

|last1= Friendly

|first1= Michael

|last2= David

|first2= Thomas

|title= Discrete Data Analysis with R

|edition=First

|year=2016

|publisher= CRC Press / Chapman & Hall

|isbn= 978-1498725835

}}

=Zero-inflated data as a mixture of two distributions=

As the examples above show, zero-inflated data can arise as a mixture of two distributions. The first distribution generates zeros. The second distribution, which may be a Poisson distribution, a negative binomial distribution or other count distribution, generates counts, some of which may be zeros.

In the statistical literature, different authors may use different names to distinguish zeros from the two distributions. Some authors describe zeros generated by the first (binary) distribution as "structural" and zeros generated by the second (count) distribution as "random". Other authors use the terminology "immune" and "susceptible" for the binary and count zeros, respectively.

Zero-inflated Poisson

File:Zero-inflated-poisson-distribution.png

One well-known zero-inflated model is Diane Lambert's zero-inflated Poisson model, which concerns a random event containing excess zero-count data in unit time.{{cite journal | title = Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing | first =Diane | last = Lambert | author-link = Diane Lambert | journal = Technometrics | year = 1992 | volume = 34 | issue = 1 | pages = 1–14 | jstor=1269547|doi=10.2307/1269547 }} For example, the number of insurance claims within a population for a certain type of risk would be zero-inflated by those people who have not taken out insurance against the risk and thus are unable to claim. The zero-inflated Poisson (ZIP) model mixes two zero generating processes. The first process generates zeros. The second process is governed by a Poisson distribution that generates counts, some of which may be zero. The mixture distribution is described as follows:

: \Pr (Y = 0) = \pi + (1 - \pi) e^{-\lambda}

:\Pr (Y = y_i) = (1 - \pi) \frac{\lambda^{y_i} e^{-\lambda}} {y_i!},\qquad y_i = 1,2,3,...

where the outcome variable y_i has any non-negative integer value, \lambda is the expected Poisson count for the ith individual; \pi is the probability of extra zeros.

The mean is (1-\pi) \lambda and the variance is \lambda (1-\pi) (1+\pi \lambda) .

Estimators of ZIP parameters

The method of moments estimators are given by{{cite journal |last1=Beckett |first1=Sadie |last2=Jee |first2=Joshua |last3=Ncube |first3=Thalepo |last4=Washington |first4=Quintel |last5=Singh |first5=Anshuman |last6=Pal |first6=Nabendu |title=Zero-inflated Poisson (ZIP) distribution: parameter estimation and applications to model data from natural calamities |journal=Involve |date=2014 |volume=7 |issue=6 |pages=751–767 |doi=10.2140/involve.2014.7.751|doi-access=free }}

: \hat{\lambda}_{mo} = \frac{s^2+m^2}{m} - 1,

: \hat{\pi}_{mo} = \frac{s^2-m}{s^2 + m^2-m},

where m is the sample mean and s^2 is the sample variance.

The maximum likelihood estimator{{cite book |last1=Johnson |first1=Norman L.| last2=Kotz | first2=Samuel |last3=Kemp |first3=Adrienne W.|author3-link=Adrienne W. Kemp |year=1992 |title=Univariate Discrete Distributions |edition=2nd |publisher=Wiley |pages=312–314 |isbn=978-0-471-54897-3 }} can be found by solving the following equation

: m(1- e^{-\hat{\lambda}_{ml}}) = \hat{\lambda}_{ml} \left( 1 - \frac{n_0}{n} \right).

where \frac{n_0}{n} is the observed proportion of zeros.

A closed form solution of this equation is given by{{cite journal |last=Dencks |first=Stefanie |last2=Piepenbrock |first2=Marion |last3=Schmitz |first3=Georg |year=2020 |title= Assessing Vessel Reconstruction in Ultrasound Localization Microscopy by Maximum-Likelihood Estimation of a Zero-Inflated Poisson Model|journal=IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control |doi=10.1109/TUFFC.2020.2980063|doi-access=free }}

: \hat{\lambda}_{ml} = W_{0}(-s e^{-s})+s

with W_0 being the main branch of Lambert's W-function{{cite journal |last=Corless |first=R. M. |last2=Gonnet |first2=G. H. |last3=Hare |first3=D. E. G. |last4=Jeffrey |first4=D. J. |last5=Knuth |first5=D. E. |year=1996 |title=On the Lambert W Function

|journal=Advances in Computational Mathematics |volume=5 |issue=1 |pages=329–359 |doi=10.1007/BF02124750|arxiv=1809.07369 }} and

: s = \frac{ m }{1 - \frac{n_0}{n}} .

Alternatively, the equation can be solved by iteration.{{cite journal |last=Böhning |first=Dankmar |last2=Dietz |first2=Ekkehart |last3=Schlattmann |first3=Peter |last4=Mendonca |first4=Lisette |last5=Kirchner |first5=Ursula |year=1999 |title=The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology |journal=Journal of the Royal Statistical Society, Series A |volume=162 |issue=2 |pages=195–209 |doi=10.1111/1467-985x.00130}}

The maximum likelihood estimator for \pi is given by

: \hat{\pi}_{ml} = 1 - \frac{m}{\hat{\lambda}_{ml}}.

Related models

In 1994, Greene considered the zero-inflated negative binomial (ZINB) model.{{Cite journal |last=Greene | first=William H. | title=Some Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models |journal=Working Paper EC-94-10: Department of Economics, New York University |year=1994|ssrn=1293115}} Daniel B. Hall adapted Lambert's methodology to an upper-bounded count situation, thereby obtaining a zero-inflated binomial (ZIB) model.{{cite journal | title = Zero-Inflated Poisson and Binomial Regression with Random Effects: A Case Study | first =Daniel B. | last = Hall | journal =Biometrics| year = 2000 | volume = 56 | issue = 4| pages = 1030–1039| doi=10.1111/j.0006-341X.2000.01030.x}}

Discrete pseudo compound Poisson model

If the count data Y is such that the probability of zero is larger than the probability of nonzero, namely

: \Pr (Y = 0) > 0.5

then the discrete data Y obey discrete pseudo compound Poisson distribution.{{Cite journal |first =Zhang | last = Huiming |author2=Yunxiao Liu |author3=Bo Li |title=Notes on discrete compound Poisson model with applications to risk theory |journal=Insurance: Mathematics and Economics |volume=59 |year=2014|pages=325–336|doi=10.1016/j.insmatheco.2014.09.012}}

In fact, let G(z) = \sum\limits_{n = 0}^\infty P(Y = n)z^n be the probability generating function of y_i. If p_0=\Pr (Y = 0) > 0.5 , then |G(z)| \geqslant p_0 - \sum\limits_{i = 1}^\infty p_i = 2p_0-1 > 0. Then from the Wiener–Lévy theorem,{{cite book |last=Zygmund |first=A. |year=2002 |title= Trigonometric Series|title-link= Trigonometric Series |publisher=Cambridge University Press |location=Cambridge |page=245 }} G(z) has the probability generating function of the discrete pseudo compound Poisson distribution.

We say that the discrete random variable Y satisfying probability generating function characterization

: G_Y(z) = \sum\limits_{n = 0}^\infty P(Y = n)z^n = \exp\left(\sum_{k=1}^\infty \alpha_k \lambda (z^k - 1)\right), \quad (|z| \le 1)

has a discrete pseudo compound Poisson distribution with parameters

: (\lambda_1, \lambda_2, \ldots ) = (\alpha_1 \lambda,\alpha_2 \lambda, \ldots ) \in \mathbb{R}^\infty \left( \sum_{k = 1}^\infty \alpha _k = 1, \sum\limits_{k = 1}^\infty |\alpha_k| < \infty, \alpha_k \in \mathbb{R},\lambda > 0 \right).

When all the \alpha_k are non-negative, it is the discrete compound Poisson distribution (non-Poisson case) with overdispersion property.

See also

Software

  • [https://cran.r-project.org/web/packages/pscl/index.html pscl], [https://cran.r-project.org/web/packages/glmmTMB/index.html glmmTMB] and [https://paul-buerkner.github.io/brms/ brms] R packages

References

{{Reflist}}

{{Statistics|correlation}}

{{least squares and regression analysis}}

Category:Generalized linear models

Category:Categorical data

Category:Poisson point processes