Laplace's approximation

From HandWiki
Short description: Laplace's approximation


Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information.[1][2] The approximation is justified by the Bernstein–von Mises theorem, which states that under regularity conditions the posterior converges to a Gaussian in large samples.[3][4]

For example, a (possibly non-linear) regression or classification model with data set {xn,yn}n=1,,N comprising inputs x and outputs y has (unknown) parameter vector θ of length D. The likelihood is denoted p(𝐲|𝐱,θ) and the parameter prior p(θ). Suppose one wants to approximate the joint density of outputs and parameters p(𝐲,θ|𝐱)

p(𝐲,θ|𝐱)=p(𝐲|𝐱,θ)p(θ|𝐱)=p(𝐲|𝐱)p(θ|𝐲,𝐱)q~(θ)=Zq(θ).

The joint is equal to the product of the likelihood and the prior and by Bayes' rule, equal to the product of the marginal likelihood p(𝐲|𝐱) and posterior p(θ|𝐲,𝐱). Seen as a function of θ the joint is an un-normalised density. In Laplace's approximation we approximate the joint by an un-normalised Gaussian q~(θ)=Zq(θ), where we use q to denote approximate density, q~ for un-normalised density and Z is a constant (independent of θ). Since the marginal likelihood p(𝐲|𝐱) doesn't depend on the parameter θ and the posterior p(θ|𝐲,𝐱) normalises over θ we can immediately identify them with Z and q(θ) of our approximation, respectively. Laplace's approximation is

p(𝐲,θ|𝐱)p(𝐲,θ^|𝐱)exp(12(θθ^)S1(θθ^))=q~(θ),

where we have defined

θ^=argmaxθlogp(𝐲,θ|𝐱),S1=θθlogp(𝐲,θ|𝐱)|θ=θ^,

where θ^ is the location of a mode of the joint target density, also known as the maximum a posteriori or MAP point and S1 is the D×D positive definite matrix of second derivatives of the negative log joint target density at the mode θ=θ^. Thus, the Gaussian approximation matches the value and the curvature of the un-normalised target density at the mode. The value of θ^ is usually found using a gradient based method, e.g. Newton's method. In summary, we have

q(θ)=𝒩(θ|μ=θ^,Σ=S),logZ=logp(𝐲,θ^|𝐱)+12log|S|+D2log(2π),

for the approximate posterior over θ and the approximate log marginal likelihood respectively. In the special case of Bayesian linear regression with a Gaussian prior, the approximation is exact. The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay,[5] and for Gaussian processes by Williams and Barber.[6]

References

  1. Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1991). "Laplace’s method in Bayesian analysis". Statistical Multiple Integration. Contemporary Mathematics. 115. pp. 89–100. doi:10.1090/conm/115/07. ISBN 0-8218-5122-5. 
  2. MacKay, David J. C. (2003). "Information Theory, Inference and Learning Algorithms, chapter 27: Laplace's method". http://www.inference.org.uk/mackay/itprnn/ps/341.342.pdf. 
  3. Walker, A. M. (1969). "On the Asymptotic Behaviour of Posterior Distributions". Journal of the Royal Statistical Society. Series B (Methodological) 31 (1): 80–88. 
  4. Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1990). "The Validity of Posterior Expansions Based on Laplace's Method". in Geisser, S.; Hodges, J. S.; Press, S. J. et al.. Bayesian and Likelihood Methods in Statistics and Econometrics. Elsevier. pp. 473–488. ISBN 0-444-88376-2. 
  5. MacKay, David J. C. (1992). "Bayesian Interpolation". Neural Computation (MIT Press) 4 (3): 415–447. doi:10.1162/neco.1992.4.3.415. https://authors.library.caltech.edu/13792/1/MACnc92a.pdf. 
  6. Williams, Christopher K. I.; Barber, David (1998). "Bayesian classification with Gaussian Processes". PAMI (IEEE) 20 (12): 1342–1351. doi:10.1109/34.735807. https://publications.aston.ac.uk/id/eprint/4491/1/IEEE_transactions_on_pattern_analysis_20%2812%29.pdf.