Laplace's approximation

{{Distinguish|text = Laplace's method, which is based on an essentially identical construction. Whereas Laplace's method focusses on a limiting behaviour of the integral, Laplace's approximation isn't used in the limit, and considers both integral and integrand. This naming distinction may not be universal}}

Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information.{{cite book |first1=Robert E. |last1=Kass |first2=Luke |last2=Tierney |first3=Joseph B. |last3=Kadane |chapter=Laplace’s method in Bayesian analysis |title=Statistical Multiple Integration |series=Contemporary Mathematics |year=1991 |volume=115 |pages=89–100 |isbn=0-8218-5122-5 |doi=10.1090/conm/115/07 }}{{cite web|title=Information Theory, Inference and Learning Algorithms, chapter 27: Laplace's method|first=David J. C.|last=MacKay|url=http://www.inference.org.uk/mackay/itprnn/ps/341.342.pdf|year=2003}} The approximation is justified by the Bernstein–von Mises theorem, which states that, under regularity conditions, the error of the approximation tends to 0 as the number of data points tends to infinity.{{cite book|last=Hartigan |first=J. A. |authorlink=John A. Hartigan |chapter=Asymptotic Normality of Posterior Distributions |title=Bayes Theory |series=Springer Series in Statistics |location=New York |publisher=Springer |year=1983 |pages=107–118 |isbn= 978-1-4613-8244-7|doi=10.1007/978-1-4613-8242-3_11 }}{{cite book |first1=Robert E. |last1=Kass |first2=Luke |last2=Tierney |first3=Joseph B. |last3=Kadane |chapter=The Validity of Posterior Expansions Based on Laplace's Method |pages=473–488 |editor-first=S. |editor-last=Geisser |editor2-first=J. S. |editor2-last=Hodges |editor3-first=S. J. |editor3-last=Press |editor4-first=A. |editor4-last=Zellner |title=Bayesian and Likelihood Methods in Statistics and Econometrics |location= |publisher=Elsevier |year=1990 |isbn=0-444-88376-2 }}

For example, consider a regression or classification model with data set $\{x_n,y_n\}_{n=1,\ldots,N}$ comprising inputs $x$ and outputs $y$ with (unknown) parameter vector $\theta$ of length $D$ . The likelihood is denoted $p({\bf y}|{\bf x},\theta)$ and the parameter prior $p(\theta)$ . Suppose one wants to approximate the joint density of outputs and parameters $p({\bf y},\theta|{\bf x})$ . Bayes' formula reads:

: $p({\bf y},\theta|{\bf x})\;=\;p({\bf y}|{\bf x},\theta)p(\theta|{\bf x})\;=\;p({\bf y}|{\bf x})p(\theta|{\bf y},{\bf x})\;\simeq\;\tilde q(\theta)\;=\;Zq(\theta).$

The joint is equal to the product of the likelihood and the prior and by Bayes' rule, equal to the product of the marginal likelihood $p({\bf y}|{\bf x})$ and posterior $p(\theta|{\bf y},{\bf x})$ . Seen as a function of $\theta$ the joint is an un-normalised density.

In Laplace's approximation, we approximate the joint by an un-normalised Gaussian $\tilde q(\theta)=Zq(\theta)$ , where we use $q$ to denote approximate density, $\tilde q$ for un-normalised density and $Z$ the normalisation constant of $\tilde q$ (independent of $\theta$ ). Since the marginal likelihood $p({\bf y}|{\bf x})$ doesn't depend on the parameter $\theta$ and the posterior $p(\theta|{\bf y},{\bf x})$ normalises over $\theta$ we can immediately identify them with $Z$ and $q(\theta)$ of our approximation, respectively.

Laplace's approximation is

: $p({\bf y},\theta|{\bf x})\;\simeq\;p({\bf y},\hat\theta|{\bf x})\exp\big(-\tfrac{1}{2}(\theta-\hat\theta)^\top S^{-1}(\theta-\hat\theta)\big)\;=\;\tilde q(\theta),$

where we have defined

: $\begin{align}
\hat\theta &\;=\; \operatorname{argmax}_\theta \log p({\bf y},\theta|{\bf x}),\\
S^{-1} &\;=\; -\left.\nabla_\theta\nabla_\theta\log p({\bf y},\theta|{\bf x})\right|_{\theta=\hat\theta},
\end{align}$

where $\hat\theta$ is the location of a mode of the joint target density, also known as the maximum a posteriori or MAP point and $S^{-1}$ is the $D\times D$ positive definite matrix of second derivatives of the negative log joint target density at the mode $\theta=\hat\theta$ . Thus, the Gaussian approximation matches the value and the log-curvature of the un-normalised target density at the mode. The value of $\hat\theta$ is usually found using a gradient based method.

In summary, we have

: $\begin{align}
q(\theta) &\;=\; {\cal N}(\theta|\mu=\hat\theta,\Sigma=S),\\
\log Z &\;=\; \log p({\bf y},\hat\theta|{\bf x}) + \tfrac{1}{2}\log|S| + \tfrac{D}{2}\log(2\pi),
\end{align}$

for the approximate posterior over $\theta$ and the approximate log marginal likelihood respectively.

The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay,{{cite journal |last= MacKay |first= David J. C. |date= 1992 |journal= Neural Computation |url= https://authors.library.caltech.edu/13792/1/MACnc92a.pdf |title= Bayesian Interpolation|publisher= MIT Press |volume= 4 |issue= 3 |pages= 415–447 |doi= 10.1162/neco.1992.4.3.415 |s2cid= 1762283 }} and for Gaussian processes by Williams and Barber.{{cite journal |last1= Williams |first1= Christopher K. I. |last2= Barber |first2= David |date= 1998 |journal= IEEE Transactions on Pattern Analysis and Machine Intelligence|url= https://publications.aston.ac.uk/id/eprint/4491/1/IEEE_transactions_on_pattern_analysis_20%2812%29.pdf |title =Bayesian classification with Gaussian Processes |publisher= IEEE |volume= 20 |issue= 12 |pages= 1342–1351 |doi= 10.1109/34.735807 }}

Laplace's approximation

References

Further reading