Laplace's approximation

From HandWiki
Short description: Laplace's approximation


Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information.[1][2] The approximation is justified by the Bernstein–von Mises theorem, which states that under regularity conditions the posterior converges to a Gaussian in large samples.[3][4]

For example, a (possibly non-linear) regression or classification model with data set [math]\displaystyle{ \{x_n,y_n\}_{n=1,\ldots,N} }[/math] comprising inputs [math]\displaystyle{ x }[/math] and outputs [math]\displaystyle{ y }[/math] has (unknown) parameter vector [math]\displaystyle{ \theta }[/math] of length [math]\displaystyle{ D }[/math]. The likelihood is denoted [math]\displaystyle{ p({\bf y}|{\bf x},\theta) }[/math] and the parameter prior [math]\displaystyle{ p(\theta) }[/math]. Suppose one wants to approximate the joint density of outputs and parameters [math]\displaystyle{ p({\bf y},\theta|{\bf x}) }[/math]

[math]\displaystyle{ p({\bf y},\theta|{\bf x})\;=\;p({\bf y}|{\bf x},\theta)p(\theta|{\bf x})\;=\;p({\bf y}|{\bf x})p(\theta|{\bf y},{\bf x})\;\simeq\;\tilde q(\theta)\;=\;Zq(\theta). }[/math]

The joint is equal to the product of the likelihood and the prior and by Bayes' rule, equal to the product of the marginal likelihood [math]\displaystyle{ p({\bf y}|{\bf x}) }[/math] and posterior [math]\displaystyle{ p(\theta|{\bf y},{\bf x}) }[/math]. Seen as a function of [math]\displaystyle{ \theta }[/math] the joint is an un-normalised density. In Laplace's approximation we approximate the joint by an un-normalised Gaussian [math]\displaystyle{ \tilde q(\theta)=Zq(\theta) }[/math], where we use [math]\displaystyle{ q }[/math] to denote approximate density, [math]\displaystyle{ \tilde q }[/math] for un-normalised density and [math]\displaystyle{ Z }[/math] is a constant (independent of [math]\displaystyle{ \theta }[/math]). Since the marginal likelihood [math]\displaystyle{ p({\bf y}|{\bf x}) }[/math] doesn't depend on the parameter [math]\displaystyle{ \theta }[/math] and the posterior [math]\displaystyle{ p(\theta|{\bf y},{\bf x}) }[/math] normalises over [math]\displaystyle{ \theta }[/math] we can immediately identify them with [math]\displaystyle{ Z }[/math] and [math]\displaystyle{ q(\theta) }[/math] of our approximation, respectively. Laplace's approximation is

[math]\displaystyle{ p({\bf y},\theta|{\bf x})\;\simeq\;p({\bf y},\hat\theta|{\bf x})\exp\big(-\tfrac{1}{2}(\theta-\hat\theta)S^{-1}(\theta-\hat\theta)\big)\;=\;\tilde q(\theta), }[/math]

where we have defined

[math]\displaystyle{ \begin{align} \hat\theta &\;=\; \operatorname{argmax}_\theta \log p({\bf y},\theta|{\bf x}),\\ S^{-1} &\;=\; -\left.\nabla_\theta\nabla_\theta\log p({\bf y},\theta|{\bf x})\right|_{\theta=\hat\theta}, \end{align} }[/math]

where [math]\displaystyle{ \hat\theta }[/math] is the location of a mode of the joint target density, also known as the maximum a posteriori or MAP point and [math]\displaystyle{ S^{-1} }[/math] is the [math]\displaystyle{ D\times D }[/math] positive definite matrix of second derivatives of the negative log joint target density at the mode [math]\displaystyle{ \theta=\hat\theta }[/math]. Thus, the Gaussian approximation matches the value and the curvature of the un-normalised target density at the mode. The value of [math]\displaystyle{ \hat\theta }[/math] is usually found using a gradient based method, e.g. Newton's method. In summary, we have

[math]\displaystyle{ \begin{align} q(\theta) &\;=\; {\cal N}(\theta|\mu=\hat\theta,\Sigma=S),\\ \log Z &\;=\; \log p({\bf y},\hat\theta|{\bf x}) + \tfrac{1}{2}\log|S| + \tfrac{D}{2}\log(2\pi), \end{align} }[/math]

for the approximate posterior over [math]\displaystyle{ \theta }[/math] and the approximate log marginal likelihood respectively. In the special case of Bayesian linear regression with a Gaussian prior, the approximation is exact. The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay,[5] and for Gaussian processes by Williams and Barber.[6]

References

  1. Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1991). "Laplace’s method in Bayesian analysis". Statistical Multiple Integration. Contemporary Mathematics. 115. pp. 89–100. doi:10.1090/conm/115/07. ISBN 0-8218-5122-5. 
  2. MacKay, David J. C. (2003). "Information Theory, Inference and Learning Algorithms, chapter 27: Laplace's method". http://www.inference.org.uk/mackay/itprnn/ps/341.342.pdf. 
  3. Walker, A. M. (1969). "On the Asymptotic Behaviour of Posterior Distributions". Journal of the Royal Statistical Society. Series B (Methodological) 31 (1): 80–88. 
  4. Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1990). "The Validity of Posterior Expansions Based on Laplace's Method". in Geisser, S.; Hodges, J. S.; Press, S. J. et al.. Bayesian and Likelihood Methods in Statistics and Econometrics. Elsevier. pp. 473–488. ISBN 0-444-88376-2. 
  5. MacKay, David J. C. (1992). "Bayesian Interpolation". Neural Computation (MIT Press) 4 (3): 415–447. doi:10.1162/neco.1992.4.3.415. https://authors.library.caltech.edu/13792/1/MACnc92a.pdf. 
  6. Williams, Christopher K. I.; Barber, David (1998). "Bayesian classification with Gaussian Processes". PAMI (IEEE) 20 (12): 1342–1351. doi:10.1109/34.735807. https://publications.aston.ac.uk/id/eprint/4491/1/IEEE_transactions_on_pattern_analysis_20%2812%29.pdf.