Dirichlet process
{{Short description|Family of stochastic processes}}
File:Dirichlet process draws.svg view of the Dirichlet process.]]
In probability theory, Dirichlet processes (after the distribution associated with Peter Gustav Lejeune Dirichlet) are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.
As an example, a bag of 100 real-world dice is a random probability mass function (random pmf)—to sample this random pmf you put your hand in the bag and draw out a die, that is, you draw a pmf. A bag of dice manufactured using a crude process 100 years ago will likely have probabilities that deviate wildly from the uniform pmf, whereas a bag of state-of-the-art dice used by Las Vegas casinos may have barely perceptible imperfections. We can model the randomness of pmfs with the Dirichlet distribution.{{cite web |last1=Frigyik |first1=Bela A. |last2=Kapila |first2=Amol |last3=Gupta |first3=Maya R. |title=Introduction to the Dirichlet Distribution and Related Processes |url=http://mayagupta.org/publications/FrigyikKapilaGuptaIntroToDirichlet.pdf |access-date=2 September 2021}}
The Dirichlet process is specified by a base distribution and a positive real number called the concentration parameter (also known as scaling parameter). The base distribution is the expected value of the process, i.e., the Dirichlet process draws distributions "around" the base distribution the way a normal distribution draws real numbers around its mean. However, even if the base distribution is continuous, the distributions drawn from the Dirichlet process are almost surely discrete. The scaling parameter specifies how strong this discretization is: in the limit of , the realizations are all concentrated at a single value, while in the limit of the realizations become continuous. Between the two extremes the realizations are discrete distributions with less and less concentration as increases.
The Dirichlet process can also be seen as the infinite-dimensional generalization of the Dirichlet distribution. In the same way as the Dirichlet distribution is the conjugate prior for the categorical distribution, the Dirichlet process is the conjugate prior for infinite, nonparametric discrete distributions. A particularly important application of Dirichlet processes is as a prior probability distribution in infinite mixture models.
The Dirichlet process was formally introduced by Thomas S. Ferguson in 1973.{{cite journal
| last = Ferguson
| first = Thomas
| title = Bayesian analysis of some nonparametric problems
| journal = Annals of Statistics
| volume = 1
| issue=2
| pages = 209–230
| year = 1973
| doi = 10.1214/aos/1176342360
| mr = 350949
| doi-access = free
}}
It has since been applied in data mining and machine learning, among others for natural language processing, computer vision and bioinformatics.
Introduction
Dirichlet processes are usually used when modelling data that tends to repeat previous values in a so-called "rich get richer" fashion. Specifically, suppose that the generation of values can be simulated by the following algorithm.
:Input: (a probability distribution called base distribution), (a positive real number called scaling parameter)
:For :
:::: a) With probability draw from .
:::: b) With probability set , where is the number of previous observations of .
::::: (Formally,
At the same time, another common model for data is that the observations
The
- Draw a distribution
P from\operatorname{DP}\left(H,\alpha\right) - Draw observations
X_1,X_2,\dots independently fromP .
In practice, however, drawing a concrete distribution
Formal definition
Given a measurable set S, a base probability distribution H and a positive real number
:
:
where
Alternative views
There are several equivalent views of the Dirichlet process. Besides the formal definition above, the Dirichlet process can be defined implicitly through de Finetti's theorem as described in the first section; this is often called the Chinese restaurant process. A third alternative is the stick-breaking process, which defines the Dirichlet process constructively by writing a distribution sampled from the process as
=The Chinese restaurant process=
{{main|Chinese restaurant process}}
File:Chinese Restaurant Process for DP(0.5,H).webm
A widely employed metaphor for the Dirichlet process is based on the so-called Chinese restaurant process. The metaphor is as follows:
Imagine a Chinese restaurant in which customers enter. A new customer sits down at a table with a probability proportional to the number of customers already sitting there. Additionally, a customer opens a new table with a probability proportional to the scaling parameter
This probability distribution over the tables is a random sample of the probabilities of observations drawn from a Dirichlet process with scaling parameter
If one associates draws from the base measure
The Chinese restaurant process is related to the Pólya urn sampling scheme which yields samples from finite Dirichlet distributions.
Because customers sit at a table with a probability proportional to the number of customers already sitting at the table, two properties of the DP can be deduced:
- The Dirichlet process exhibits a self-reinforcing property: The more often a given value has been sampled in the past, the more likely it is to be sampled again.
- Even if
H is a distribution over an uncountable set, there is a nonzero probability that two samples will have exactly the same value because the probability mass will concentrate on a small number of tables.
=The stick-breaking process=
A third approach to the Dirichlet process is the so-called stick-breaking process view. Conceptually, this involves repeatedly breaking off and discarding a random fraction (sampled from a Beta distribution) of a "stick" that is initially of length 1. Remember that draws from a Dirichlet process are distributions over a set
:
where
The locations
:
where
The smaller
The stick-breaking process is similar to the construction where one samples sequentially from marginal beta distributions in order to generate a sample from a Dirichlet distribution.For the proof, see {{cite web |last=Paisley |first=John |title=A simple proof of the stick-breaking construction of the Dirichlet Process |website=Columbia University |date=August 2010 |url=http://www.columbia.edu/~jwp2128/Teaching/E6892/papers/SimpleProof.pdf |archive-date=January 22, 2015 |archive-url=https://web.archive.org/web/20150122033202/http://www.columbia.edu/~jwp2128/Teaching/E6892/papers/SimpleProof.pdf }}
=The Pólya urn scheme=
Yet another way to visualize the Dirichlet process and Chinese restaurant process is as a modified Pólya urn scheme sometimes called the Blackwell–MacQueen sampling scheme. Imagine that we start with an urn filled with
- Each time we need an observation, we draw a ball from the urn.
- If the ball is black, we generate a new (non-black) colour uniformly, label a new ball this colour, drop the new ball into the urn along with the ball we drew, and return the colour we generated.
- Otherwise, label a new ball with the colour of the ball we drew, drop the new ball into the urn along with the ball we drew, and return the colour we observed.
The resulting distribution over colours is the same as the distribution over tables in the Chinese restaurant process. Furthermore, when we draw a black ball, if rather than generating a new colour, we instead pick a random value from a base distribution
Use as a prior distribution
The Dirichlet Process can be used as a prior distribution to estimate the probability distribution that generates the data. In this section, we consider the model
:
\begin{align}
P & \sim \textrm{DP}(H, \alpha )\\
X_1, \ldots, X_n \mid P & \, \overset{\textrm{i.i.d.}}{\sim} \, P.
\end{align}
The Dirichlet Process distribution satisfies prior conjugacy, posterior consistency, and the Bernstein–von Mises theorem.{{cite book
| title = Fundamentals of Bayesian Nonparametric Inference
| isbn = 978-0-521-87826-5
| publisher = Cambridge University Press
| author = Aad van der Vaart, Subhashis Ghosal
| year = 2017
}}
=Prior conjugacy=
In this model, the posterior distribution is again a Dirichlet process. This means that the Dirichlet process is a conjugate prior for this model. The posterior distribution is given by
:
\begin{align}
P \mid X_1, \ldots, X_n
&\sim \textrm{DP}\left(\frac{\alpha}{\alpha + n} H + \frac{1}{\alpha + n} \sum_{i = 1}^n \delta_{X_i}, \; \alpha + n \right) \\
&= \textrm{DP}\left(\frac{\alpha}{\alpha + n} H + \frac{n}{\alpha + n} \mathbb{P}_n, \; \alpha + n \right)
\end{align}
where
=Posterior consistency=
If we take the frequentist view of probability, we believe there is a true probability distribution
=Bernstein–Von Mises theorem=
In order to interpret the credible sets as confidence sets, a Bernstein–von Mises theorem is needed. In case of the Dirichlet process we compare the posterior distribution with the empirical process
:
\sqrt n \left( \mathbb{P}_n - P_0 \right) \rightsquigarrow G_{P_0}
for some Brownian Bridge
:
\sqrt{n} \left( P - \mathbb{P}_n \right) \mid X_1, \cdots, X_n \rightsquigarrow G_{P_0}.
This implies that credible sets you construct are asymptotic confidence sets, and the Bayesian inference based on the Dirichlet process is asymptotically also valid frequentist inference.
Use in Dirichlet mixture models
File:DP clustering simulation.png
To understand what Dirichlet processes are and the problem they solve we consider the example of data clustering. It is a common situation that data points are assumed to be distributed in a hierarchical fashion where each data point belongs to a (randomly chosen) cluster and the members of a cluster are further distributed randomly within that cluster.
=Example 1=
For example, we might be interested in how people will vote on a number of questions in an upcoming election. A reasonable model for this situation might be to classify each voter as a liberal, a conservative or a moderate and then model the event that a voter says "Yes" to any particular question as a Bernoulli random variable with the probability dependent on which political cluster they belong to. By looking at how votes were cast in previous years on similar pieces of legislation one could fit a predictive model using a simple clustering algorithm such as k-means. That algorithm, however, requires knowing in advance the number of clusters that generated the data. In many situations, it is not possible to determine this ahead of time, and even when we can reasonably assume a number of clusters we would still like to be able to check this assumption. For example, in the voting example above the division into liberal, conservative and moderate might not be finely tuned enough; attributes such as a religion, class or race could also be critical for modelling voter behaviour, resulting in more clusters in the model.
=Example 2=
As another example, we might be interested in modelling the velocities of galaxies using a simple model assuming that the velocities are clustered, for instance by assuming each velocity is distributed according to the normal distribution
We consider this example in more detail. A first naive model is to presuppose that there are
:
\begin{align}
(v_i \mid z_i=k, \mu_k) & \sim N(\mu_k,\sigma^2) \\
\operatorname{P} (z_i=k) & = \pi_k \\
(\boldsymbol{\pi}\mid \alpha) & \sim \operatorname{Dir}\left(\frac{\alpha}{K} \cdot \mathbf{1}_K\right) \\
\mu_k & \sim H(\lambda)
\end{align}
That is, we assume that the data belongs to
:
\begin{align}
(v_i \mid \tilde{\mu}_i) & \sim N(\tilde{\mu}_i,\sigma^2) \\
\tilde{\mu}_i & \sim G=\sum_{k=1}^K \pi_k \delta_{\mu_k} (\tilde{\mu}_i) \\
(\boldsymbol{\pi}\mid \alpha) & \sim \operatorname{Dir}\left(\frac{\alpha}{K} \cdot \mathbf{1}_K \right) \\
\mu_k & \sim H(\lambda)
\end{align}
Instead of imagining that each data point is first assigned a cluster and then drawn from the distribution associated to that cluster we now think of each observation being associated with parameter
File:Parameter estimation process infinite Gaussian mixture model.webm
We would now like to extend this model to work without pre-specifying a fixed number of clusters
:
\begin{align}
(v_i \mid \tilde{\mu}_i) & \sim N(\tilde{\mu}_i,\sigma^2)\\
\tilde{\mu}_i & \sim G \\
G & \sim \operatorname{DP}(H(\lambda),\alpha)
\end{align}
With this in hand we can better understand the computational merits of the Dirichlet process. Suppose that we wanted to draw
Fitting the model described above based on observed data
Applications of the Dirichlet process
Dirichlet processes are frequently used in Bayesian nonparametric statistics. "Nonparametric" here does not mean a parameter-less model, rather a model in which representations grow as more data are observed. Bayesian nonparametric models have gained considerable popularity in the field of machine learning because of the above-mentioned flexibility, especially in unsupervised learning. In a Bayesian nonparametric model, the prior and posterior distributions are not parametric distributions, but stochastic processes.{{cite book
| title = Bayesian Nonparametrics
| isbn = 978-0-521-51346-3
| publisher = Cambridge University Press
|author=Nils Lid Hjort |author2=Chris Holmes, Peter Müller |author3=Stephen G. Walker
| year = 2010
}} The fact that the Dirichlet distribution is a probability distribution on the simplex of sets of non-negative numbers that sum to one makes it a good candidate to model distributions over distributions or distributions over functions. Additionally, the nonparametric nature of this model makes it an ideal candidate for clustering problems where the distinct number of clusters is unknown beforehand. In addition, the Dirichlet process has also been used for developing a mixture of expert models, in the context of supervised learning algorithms (regression or classification settings). For instance, mixtures of Gaussian process experts, where the number of required experts must be inferred from the data.Sotirios P. Chatzis, "A Latent Variable Gaussian Process Model with Pitman-Yor Process Priors for Multiclass Classification," Neurocomputing, vol. 120, pp. 482–489, Nov. 2013. {{doi|10.1016/j.neucom.2013.04.029}}Sotirios P. Chatzis, Yiannis Demiris, "Nonparametric mixtures of Gaussian processes with power-law behaviour," IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 12, pp. 1862–1871, Dec. 2012. {{doi|10.1109/TNNLS.2012.2217986}}
As draws from a Dirichlet process are discrete, an important use is as a prior probability in infinite mixture models. In this case,
| last = Rasmussen
| first = Carl
| title = The Infinite Gaussian Mixture Model
| journal = Advances in Neural Information Processing Systems
| volume = 12
| pages = 554–560
| year = 2000
| url = http://www.gatsby.ucl.ac.uk/~edward/pub/inf.mix.nips.99.pdf
}} as well as associated mixture regression models, e.g.Sotirios P. Chatzis, Dimitrios Korkinof, and Yiannis Demiris, "A nonparametric Bayesian approach toward robot learning by demonstration," Robotics and Autonomous Systems, vol. 60, no. 6, pp. 789–802, June 2012. {{doi|10.1016/j.robot.2012.02.005}}
The infinite nature of these models also lends them to natural language processing applications, where it is often desirable to treat the vocabulary as an infinite, discrete set.
The Dirichlet Process can also be used for nonparametric hypothesis testing, i.e. to develop Bayesian nonparametric versions of the classical nonparametric hypothesis tests, e.g. sign test, Wilcoxon rank-sum test, Wilcoxon signed-rank test, etc.
For instance, Bayesian nonparametric versions of the Wilcoxon rank-sum test and the Wilcoxon signed-rank test have been developed by using the imprecise Dirichlet process, a prior ignorance Dirichlet process. {{Citation needed|date=June 2020}}
Related distributions
- The Pitman–Yor process is a generalization of the Dirichlet process to accommodate power-law tails
- The hierarchical Dirichlet process extends the ordinary Dirichlet process for modelling grouped data.
References
External links
- [https://web.archive.org/web/20150219021331/https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0006.pdf Introduction to the Dirichlet Distribution and Related Processes by Frigyik, Kapila and Gupta]
- [http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/Teh2010a.pdf Yee Whye Teh's overview of Dirichlet processes]
- [https://web.archive.org/web/20070524045420/http://www.cs.toronto.edu/~beal/npbayes/ Webpage for the NIPS 2003 workshop on non-parametric Bayesian methods]
- [http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps Michael Jordan's NIPS 2005 tutorial: Nonparametric Bayesian Methods: Dirichlet Processes, Chinese Restaurant Processes and All That]
- [http://www.maths.bris.ac.uk/~maxvd/cribsheet.pdf Peter Green's summary of construction of Dirichlet Processes]
- [http://www.stats.bris.ac.uk/~peter/papers/GreenCDP.pdf Peter Green's paper on probabilistic models of Dirichlet Processes with implications for statistical modelling and analysis]
- [http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf Zoubin Ghahramani's UAI 2005 tutorial on Nonparametric Bayesian methods]
- [http://ClusterAnalysis.org GIMM software for performing cluster analysis using Infinite Mixture Models]
- [https://archive.today/20121215093339/http://www.ece.sunysb.edu/~zyweng/dpcluster.html A Toy Example of Clustering using Dirichlet Process.] by Zhiyuan Weng
{{Stochastic processes}}
{{Peter Gustav Lejeune Dirichlet}}
{{DEFAULTSORT:Dirichlet Process}}