Kullback–Leibler divergence

From HandWiki
Short description: Mathematical statistics measurement

In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence[1]), denoted [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math], is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q.[2][3] A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in the two distributions (in contrast to variation of information), and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence,[4] a generalization of squared distance, and for certain classes of distributions (notably an exponential family), it satisfies a generalized Pythagorean theorem (which applies to squared distances).[5]

In the simple case, a relative entropy of 0 indicates that the two distributions in question have identical quantities of information. Relative entropy is a nonnegative function of two distributions or measures. It has diverse applications, both theoretical, such as characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference; and practical, such as applied statistics, fluid mechanics, neuroscience and bioinformatics.

Introduction and context

Consider two probability distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math]. Usually, [math]\displaystyle{ P }[/math] represents the data, the observations, or a measured probability distribution. Distribution [math]\displaystyle{ Q }[/math] represents instead a theory, a model, a description or an approximation of [math]\displaystyle{ P }[/math]. The Kullback–Leibler divergence is then interpreted as the average difference of the number of bits required for encoding samples of [math]\displaystyle{ P }[/math] using a code optimized for [math]\displaystyle{ Q }[/math] rather than one optimized for [math]\displaystyle{ P }[/math]. Note that the roles of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] can be reversed in some situations where that is easier to compute, such as with the Expectation–maximization (EM) algorithm and Evidence lower bound (ELBO) computations.

Etymology

The relative entropy was introduced by Solomon Kullback and Richard Leibler in (Kullback Leibler) as "the mean information for discrimination between [math]\displaystyle{ H_1 }[/math] and [math]\displaystyle{ H_2 }[/math] per observation from [math]\displaystyle{ \mu_1 }[/math]",[6] where one is comparing two probability measures [math]\displaystyle{ \mu_1, \mu_2 }[/math], and [math]\displaystyle{ H_1, H_2 }[/math] are the hypotheses that one is selecting from measure [math]\displaystyle{ \mu_1, \mu_2 }[/math] (respectively). They denoted this by [math]\displaystyle{ I(1:2) }[/math], and defined the "'divergence' between [math]\displaystyle{ \mu_1 }[/math] and [math]\displaystyle{ \mu_2 }[/math]" as the symmetrized quantity [math]\displaystyle{ J(1,2) = I(1:2) + I(2:1) }[/math], which had already been defined and used by Harold Jeffreys in 1948.[7] In (Kullback 1959), the symmetrized form is again referred to as the "divergence", and the relative entropies in each direction are referred to as a "directed divergences" between two distributions;[8] Kullback preferred the term discrimination information.[9] The term "divergence" is in contrast to a distance (metric), since the symmetrized divergence does not satisfy the triangle inequality.[10] Numerous references to earlier uses of the symmetrized divergence and to other statistical distances are given in (Kullback 1959). The asymmetric "directed divergence" has come to be known as the Kullback–Leibler divergence, while the symmetrized "divergence" is now referred to as the Jeffreys divergence.

Definition

For discrete probability distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] defined on the same probability space, [math]\displaystyle{ \mathcal{X} }[/math], the relative entropy from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is defined[11] to be

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \sum_{x\in\mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right). }[/math]

which is equivalent to

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = -\sum_{x\in\mathcal{X}} P(x) \log\left(\frac{Q(x)}{P(x)}\right) }[/math]

In other words, it is the expectation of the logarithmic difference between the probabilities [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], where the expectation is taken using the probabilities [math]\displaystyle{ P }[/math].

Relative entropy is defined so only if for all [math]\displaystyle{ x }[/math], [math]\displaystyle{ Q(x) = 0 }[/math] implies [math]\displaystyle{ P(x) = 0 }[/math] (absolute continuity). Else it is often defined as [math]\displaystyle{ +\infty }[/math],[1] but the value [math]\displaystyle{ +\infty }[/math] is possible even if [math]\displaystyle{ Q(x) \ne 0 }[/math] everywhere,[12][13] provided that [math]\displaystyle{ \mathcal{X} }[/math] is infinite. Analogous comments apply to the continuous and general measure cases defined below.

Whenever [math]\displaystyle{ P(x) }[/math] is zero the contribution of the corresponding term is interpreted as zero because

[math]\displaystyle{ \lim_{x \to 0^{+}} x \log(x) = 0. }[/math]

For distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] of a continuous random variable, relative entropy is defined to be the integral:[14]:p. 55

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{-\infty}^\infty p(x) \log\left(\frac{p(x)}{q(x)}\right)\, dx }[/math]

where [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] denote the probability densities of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math].

More generally, if [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] are probability measures over a set [math]\displaystyle{ {\mathcal{X}} }[/math], and [math]\displaystyle{ P }[/math] is absolutely continuous with respect to [math]\displaystyle{ Q }[/math], then the relative entropy from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is defined as

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{\mathcal{X}} \log\left(\frac{dP}{dQ}\right)\, dP, }[/math]

where [math]\displaystyle{ \frac{dP}{dQ} }[/math] is the Radon–Nikodym derivative of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math], and provided the expression on the right-hand side exists. Equivalently (by the chain rule), this can be written as

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{\mathcal{X}} \log\left(\frac{dP}{dQ}\right) \frac{dP}{dQ}\, dQ, }[/math]

which is the entropy of [math]\displaystyle{ P }[/math] relative to [math]\displaystyle{ Q }[/math]. Continuing in this case, if [math]\displaystyle{ \mu }[/math] is any measure on [math]\displaystyle{ \mathcal{X} }[/math] for which the densities [math]\displaystyle{ p = \frac{dP}{d\mu} }[/math] and [math]\displaystyle{ q = \frac{dQ}{d\mu} }[/math] exist (meaning that [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] are absolutely continuous with respect to [math]\displaystyle{ \mu }[/math]), then the relative entropy from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is given as

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{\mathcal{X}} p \log\left(\frac{p}{q}\right)\, d\mu. }[/math]

Note that there is no loss of generality in assuming the existence of densities, since [math]\displaystyle{ \mu }[/math] can always taken to be [math]\displaystyle{ \mu = \frac{1}{2} \left(P+Q\right) }[/math]. The logarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base [math]\displaystyle{ e }[/math] if information is measured in nats. Most formulas involving relative entropy hold regardless of the base of the logarithm.

Various conventions exist for referring to [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] in words. Often it is referred to as the divergence between [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], but this fails to convey the fundamental asymmetry in the relation. Sometimes, as in this article, it may be described as the divergence of [math]\displaystyle{ P }[/math] from [math]\displaystyle{ Q }[/math] or as the divergence from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math]. This reflects the asymmetry in Bayesian inference, which starts from a prior [math]\displaystyle{ Q }[/math] and updates to the posterior [math]\displaystyle{ P }[/math]. Another common way to refer to [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] is as the relative entropy of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math].

Basic example

Kullback[3] gives the following example (Table 2.1, Example 2.1). Let P and Q be the distributions shown in the table and figure. P is the distribution on the left side of the figure, a binomial distribution with [math]\displaystyle{ N = 2 }[/math] and [math]\displaystyle{ p = 0.4 }[/math]. Q is the distribution on the right side of the figure, a discrete uniform distribution with the three possible outcomes [math]\displaystyle{ x = }[/math] 0, 1, 2 (i.e. [math]\displaystyle{ \mathcal{X}=\{0,1,2\} }[/math]), each with probability [math]\displaystyle{ p = 1/3 }[/math].

Two distributions to illustrate relative entropy
[math]\displaystyle{ x }[/math] 0 1 2
Distribution [math]\displaystyle{ P(x) }[/math] [math]\displaystyle{ \frac 9{25} }[/math] [math]\displaystyle{ \frac{12}{25} }[/math] [math]\displaystyle{ \frac 4{25} }[/math]
Distribution [math]\displaystyle{ Q(x) }[/math] [math]\displaystyle{ \frac 1 3 }[/math] [math]\displaystyle{ \frac 1 3 }[/math] [math]\displaystyle{ \frac 1 3 }[/math]

Relative entropies [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] and [math]\displaystyle{ D_\text{KL}(Q \parallel P) }[/math] are calculated as follows. This example uses the natural log with base e, designated ln to get results in nats (see units of information).

[math]\displaystyle{ \begin{align} D_\text{KL}(P \parallel Q) &= \sum_{x\in\mathcal{X}} P(x) \ln\left(\frac{P(x)}{Q(x)}\right) \\ &= \frac{9}{25} \ln\left(\frac{9/25}{1/3}\right) + \frac{12}{25} \ln\left(\frac{12/25}{1/3}\right) + \frac{4}{25} \ln\left(\frac{4/25}{1/3}\right) \\ & = \frac{1}{25} \left(32 \ln(2) + 55 \ln(3) - 50 \ln(5) \right) \approx 0.0852996 \end{align} }[/math]
[math]\displaystyle{ \begin{align} D_\text{KL}(Q \parallel P) &= \sum_{x\in\mathcal{X}} Q(x) \ln\left(\frac{Q(x)}{P(x)}\right) \\ &= \frac{1}{3} \ln\left(\frac{1/3}{9/25}\right) + \frac{1}{3} \ln\left(\frac{1/3}{12/25}\right) + \frac{1}{3} \ln\left(\frac{1/3}{4/25}\right) \\ &= \frac{1}{3} \left(-4 \ln(2) - 6 \ln(3) + 6 \ln(5) \right) \approx 0.097455 \end{align} }[/math]

Interpretations

The relative entropy from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is often denoted [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math].

In the context of machine learning, [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] is often called the information gain achieved if [math]\displaystyle{ P }[/math] would be used instead of [math]\displaystyle{ Q }[/math] which is currently used. By analogy with information theory, it is called the relative entropy of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math]. In the context of coding theory, [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] can be constructed by measuring the expected number of extra bits required to code samples from [math]\displaystyle{ P }[/math] using a code optimized for [math]\displaystyle{ Q }[/math] rather than the code optimized for [math]\displaystyle{ P }[/math].

Expressed in the language of Bayesian inference, [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] is a measure of the information gained by revising one's beliefs from the prior probability distribution [math]\displaystyle{ Q }[/math] to the posterior probability distribution [math]\displaystyle{ P }[/math]. In other words, it is the amount of information lost when [math]\displaystyle{ Q }[/math] is used to approximate [math]\displaystyle{ P }[/math].[15] In applications, [math]\displaystyle{ P }[/math] typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while [math]\displaystyle{ Q }[/math] typically represents a theory, model, description, or approximation of [math]\displaystyle{ P }[/math]. In order to find a distribution [math]\displaystyle{ Q }[/math] that is closest to [math]\displaystyle{ P }[/math], we can minimize KL divergence and compute an information projection.

While it is a statistical distance, it is not a metric, the most familiar type of distance, but instead it is a divergence.[4] While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem. In general [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] does not equal [math]\displaystyle{ D_\text{KL}(Q \parallel P) }[/math], and the asymmetry is an important part of the geometry.[4] The infinitesimal form of relative entropy, specifically its Hessian, gives a metric tensor that equals the Fisher information metric; see § Fisher information metric. Relative entropy satisfies a generalized Pythagorean theorem for exponential families (geometrically interpreted as dually flat manifolds), and this allows one to minimize relative entropy by geometric means, for example by information projection and in maximum likelihood estimation.[5]

Relative entropy is a special case of a broader class of statistical divergences called f-divergences as well as the class of Bregman divergences, and it is the only such divergence over probabilities that is a member of both classes.

Arthur Hobson proved that relative entropy is the only measure of difference between probability distributions that satisfies some desired properties, which are the canonical extension to those appearing in a commonly used characterization of entropy.[16] Consequently, mutual information is the only measure of mutual dependence that obeys certain related conditions, since it can be defined in terms of Kullback–Leibler divergence.

Motivation

Illustration of the relative entropy for two normal distributions. The typical asymmetry is clearly visible.

In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value [math]\displaystyle{ x_i }[/math] out of a set of possibilities [math]\displaystyle{ X }[/math] can be seen as representing an implicit probability distribution [math]\displaystyle{ q(x_i)=2^{-\ell_i} }[/math] over [math]\displaystyle{ X }[/math], where [math]\displaystyle{ \ell_i }[/math] is the length of the code for [math]\displaystyle{ x_i }[/math] in bits. Therefore, relative entropy can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution [math]\displaystyle{ Q }[/math] is used, compared to using a code based on the true distribution [math]\displaystyle{ P }[/math]: it is the excess entropy.

[math]\displaystyle{ \begin{align} D_\text{KL}(P\parallel Q) &= \sum_{x\in\mathcal{X}} p(x) \log \frac{1}{q(x)} - \sum_{x\in\mathcal{X}} p(x) \log \frac{1}{p(x)} \\[5pt] &= \Eta(P, Q) - \Eta(P) \end{align} }[/math]

where [math]\displaystyle{ \Eta(P,Q) }[/math] is the cross entropy of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], and [math]\displaystyle{ \Eta(P) }[/math] is the entropy of [math]\displaystyle{ P }[/math] (which is the same as the cross-entropy of P with itself).

The relative entropy [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] can be thought of geometrically as a statistical distance, a measure of how far the distribution Q is from the distribution P. Geometrically it is a divergence: an asymmetric, generalized form of squared distance. The cross-entropy [math]\displaystyle{ H(P,Q) }[/math] is itself such a measurement (formally a loss function), but it cannot be thought of as a distance, since [math]\displaystyle{ H(P,P)=:H(P) }[/math] isn't zero. This can be fixed by subtracting [math]\displaystyle{ H(P) }[/math] to make [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] agree more closely with our notion of distance, as the excess loss. The resulting function is asymmetric, and while this can be symmetrized (see § Symmetrised divergence), the asymmetric form is more useful. See § Interpretations for more on the geometric interpretation.

Relative entropy relates to "rate function" in the theory of large deviations.[17][18]

Properties

  • Relative entropy is always non-negative, [math]\displaystyle{ D_\text{KL}(P \parallel Q) \geq 0, }[/math] a result known as Gibbs' inequality, with [math]\displaystyle{ D_\text{KL}(P\parallel Q) }[/math] equals zero if and only if [math]\displaystyle{ P = Q }[/math] almost everywhere. The entropy [math]\displaystyle{ \Eta(P) }[/math] thus sets a minimum value for the cross-entropy [math]\displaystyle{ \Eta(P, Q) }[/math], the expected number of bits required when using a code based on [math]\displaystyle{ Q }[/math] rather than [math]\displaystyle{ P }[/math]; and the Kullback–Leibler divergence therefore represents the expected number of extra bits that must be transmitted to identify a value [math]\displaystyle{ x }[/math] drawn from [math]\displaystyle{ X }[/math], if a code is used corresponding to the probability distribution [math]\displaystyle{ Q }[/math], rather than the "true" distribution [math]\displaystyle{ P }[/math].
  • no upper-bound exists for the general case. However, it is shown that if [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] are two discrete probability distributions built by distributing the same discrete quantity, then the maximum value of [math]\displaystyle{ D_\text{KL}(P\parallel Q) }[/math] can be calculated.[19]
  • Relative entropy remains well-defined for continuous distributions, and furthermore is invariant under parameter transformations. For example, if a transformation is made from variable [math]\displaystyle{ x }[/math] to variable [math]\displaystyle{ y(x) }[/math], then, since [math]\displaystyle{ P(x) \, dx = P(y) \, dy }[/math] and [math]\displaystyle{ Q(x) \, dx = Q(y) \, dy }[/math] the relative entropy may be rewritten: [math]\displaystyle{ \begin{align} D_\text{KL}(P \parallel Q) &= \int_{x_a}^{x_b} P(x) \log\left(\frac{P(x)}{Q(x)}\right)\, dx \\[6pt] &= \int_{y_a}^{y_b} P(y) \log\left(\frac{P(y)\, \frac{dy}{dx}}{Q(y)\, \frac{dy}{dx}}\right)\, dy \\ &= \int_{y_a}^{y_b} P(y)\log\left(\frac{P(y)}{Q(y)}\right)\, dy \end{align} }[/math] where [math]\displaystyle{ y_a = y(x_a) }[/math] and [math]\displaystyle{ y_b = y(x_b) }[/math]. Although it was assumed that the transformation was continuous, this need not be the case. This also shows that the relative entropy produces a dimensionally consistent quantity, since if [math]\displaystyle{ x }[/math] is a dimensioned variable, [math]\displaystyle{ P(x) }[/math] and [math]\displaystyle{ Q(x) }[/math] are also dimensioned, since e.g. [math]\displaystyle{ P(x) \, dx }[/math] is dimensionless. The argument of the logarithmic term is and remains dimensionless, as it must. It can therefore be seen as in some ways a more fundamental quantity than some other properties in information theory[20] (such as self-information or Shannon entropy), which can become undefined or negative for non-discrete probabilities.
  • Relative entropy is additive for independent distributions in much the same way as Shannon entropy. If [math]\displaystyle{ P_1, P_2 }[/math] are independent distributions, with the joint density [math]\displaystyle{ P(x, y) = P_1(x)P_2(y) }[/math], and [math]\displaystyle{ Q, Q_1, Q_2 }[/math] likewise, then [math]\displaystyle{ D_\text{KL}(P \parallel Q) = D_\text{KL}(P_1 \parallel Q_1) + D_\text{KL}(P_2 \parallel Q_2). }[/math]
  • Relative entropy [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] is convex in the pair of probability mass functions [math]\displaystyle{ (p,q) }[/math], i.e. if [math]\displaystyle{ (p_1,q_1) }[/math] and [math]\displaystyle{ (p_2,q_2) }[/math] are two pairs of probability mass functions, then [math]\displaystyle{ D_\text{KL}(\lambda p_1 + (1 - \lambda) p_2 \parallel \lambda q_1 + (1 - \lambda) q_2) \le \lambda D_\text{KL}(p_1 \parallel q_1) + (1 - \lambda)D_\text{KL}(p_2 \parallel q_2) \text{ for } 0 \le \lambda \le 1. }[/math]

Duality formula for variational inference

The following result, due to Donsker and Varadhan,[21] is known as Donsker and Varadhan's variational formula.

Theorem [Duality Formula for Variational Inference] — Let [math]\displaystyle{ \Theta }[/math] be a set endowed with an appropriate [math]\displaystyle{ \sigma }[/math]-field [math]\displaystyle{ \mathcal{F} }[/math], and two probability measures [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], which formulate two probability spaces [math]\displaystyle{ (\Theta,\mathcal{F},P) }[/math] and [math]\displaystyle{ (\Theta,\mathcal{F},Q) }[/math] with [math]\displaystyle{ Q \ll P }[/math]. ([math]\displaystyle{ Q \ll P }[/math] indicates that [math]\displaystyle{ Q }[/math] is absolutely continuous with respect to [math]\displaystyle{ P }[/math].) Let [math]\displaystyle{ h }[/math] be a real-valued integrable random variable on [math]\displaystyle{ (\Theta,\mathcal{F},P) }[/math]. Then the following equality holds

[math]\displaystyle{ \log E_P[\exp h] = \text{sup}_{Q \ll P} \{ E_Q[h] - D_\text{KL}(Q \parallel P)\}. }[/math]

Further, the supremum on the right-hand side is attained if and only if it holds

[math]\displaystyle{ \frac{dQ(\theta)}{dP(\theta)} = \frac{\exp h(\theta)}{E_P[\exp h]}, }[/math]

almost surely with respect to probability measure [math]\displaystyle{ P }[/math], where [math]\displaystyle{ \frac{dQ(\theta)}{dP(\theta)} }[/math] denotes the Radon-Nikodym derivative of [math]\displaystyle{ Q }[/math] with respect to [math]\displaystyle{ P }[/math] .

For alternative proof using measure theory, see.[22]

Examples

Multivariate normal distributions

Suppose that we have two multivariate normal distributions, with means [math]\displaystyle{ \mu_0, \mu_1 }[/math] and with (non-singular) covariance matrices [math]\displaystyle{ \Sigma_0, \Sigma_1. }[/math] If the two distributions have the same dimension, [math]\displaystyle{ k }[/math], then the relative entropy between the distributions is as follows:[23]:p. 13

[math]\displaystyle{ D_\text{KL}\left(\mathcal{N}_0 \parallel \mathcal{N}_1\right) = \frac{1}{2}\left( \operatorname{tr}\left(\Sigma_1^{-1}\Sigma_0\right) - k + \left(\mu_1 - \mu_0\right)^\mathsf{T} \Sigma_1^{-1}\left(\mu_1 - \mu_0\right) + \ln\left(\frac{\det\Sigma_1}{\det\Sigma_0}\right) \right). }[/math]

The logarithm in the last term must be taken to base e since all terms apart from the last are base-e logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in nats. Dividing the entire expression above by [math]\displaystyle{ \ln(2) }[/math] yields the divergence in bits.

In a numerical implementation, it is helpful to express the result in terms of the Cholesky decompositions [math]\displaystyle{ L_0, L_1 }[/math] such that [math]\displaystyle{ \Sigma_0 = L_0L_0^T }[/math] and [math]\displaystyle{ \Sigma_1 = L_1L_1^T }[/math]. Then with [math]\displaystyle{ M }[/math] and [math]\displaystyle{ y }[/math] solutions to the triangular linear systems [math]\displaystyle{ L_1 M = L_0 }[/math], and [math]\displaystyle{ L_1 y = \mu_1 - \mu_0 }[/math],

[math]\displaystyle{ D_\text{KL}\left(\mathcal{N}_0 \parallel \mathcal{N}_1\right) = \frac{1}{2}\left( \sum_{i=1}^k M_{ii}^2 - k + |y|^2 + 2\sum_{i=1}^k \ln \frac{(L_1)_{ii}}{(L_0)_{ii}} \right). }[/math]

A special case, and a common quantity in variational inference, is the relative entropy between a diagonal multivariate normal, and a standard normal distribution (with zero mean and unit variance):

[math]\displaystyle{ D_\text{KL}\left( \mathcal{N}\left(\left(\mu_1, \ldots, \mu_k\right)^\mathsf{T}, \operatorname{diag} \left(\sigma_1^2, \ldots, \sigma_k^2\right)\right) \parallel \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right) \right) = {1 \over 2} \sum_{i=1}^k \left(\sigma_i^2 + \mu_i^2 - 1 - \ln\left(\sigma_i^2\right)\right). }[/math]

Relation to metrics

While relative entropy is a statistical distance, it is not a metric on the space of probability distributions, but instead it is a divergence.[4] While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric in general and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem. In general [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] does not equal [math]\displaystyle{ D_\text{KL}(Q \parallel P) }[/math], and while this can be symmetrized (see § Symmetrised divergence), the asymmetry is an important part of the geometry.[4]

It generates a topology on the space of probability distributions. More concretely, if [math]\displaystyle{ \{P_1,P_2,\ldots\} }[/math] is a sequence of distributions such that

[math]\displaystyle{ \lim_{n \to \infty} D_\text{KL}(P_n\parallel Q) = 0 }[/math]

then it is said that

[math]\displaystyle{ P_n \xrightarrow{D} Q . }[/math]

Pinsker's inequality entails that

[math]\displaystyle{ P_n \xrightarrow{D} P \Rightarrow P_n \xrightarrow{TV} P, }[/math]

where the latter stands for the usual convergence in total variation.

Fisher information metric

Relative entropy is directly related to the Fisher information metric. This can be made explicit as follows. Assume that the probability distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] are both parameterized by some (possibly multi-dimensional) parameter [math]\displaystyle{ \theta }[/math]. Consider then two close by values of [math]\displaystyle{ P = P(\theta) }[/math] and [math]\displaystyle{ Q = P(\theta_0) }[/math] so that the parameter [math]\displaystyle{ \theta }[/math] differs by only a small amount from the parameter value [math]\displaystyle{ \theta_0 }[/math]. Specifically, up to first order one has (using the Einstein summation convention)

[math]\displaystyle{ P(\theta) = P(\theta_0) + \Delta\theta_j \, P_j(\theta_0) + \cdots }[/math]

with [math]\displaystyle{ \Delta\theta_j = (\theta - \theta_0)_j }[/math] a small change of [math]\displaystyle{ \theta }[/math] in the [math]\displaystyle{ j }[/math] direction, and [math]\displaystyle{ P_j\left(\theta_0\right) = \frac{\partial P}{\partial \theta_j}(\theta_0) }[/math] the corresponding rate of change in the probability distribution. Since relative entropy has an absolute minimum 0 for [math]\displaystyle{ P = Q }[/math], i.e. [math]\displaystyle{ \theta = \theta_0 }[/math], it changes only to second order in the small parameters [math]\displaystyle{ \Delta\theta_j }[/math]. More formally, as for any minimum, the first derivatives of the divergence vanish

[math]\displaystyle{ \left.\frac{\partial}{\partial\theta_j}\right|_{\theta = \theta_0} D_\text{KL}(P(\theta) \parallel P(\theta_0)) = 0, }[/math]

and by the Taylor expansion one has up to second order

[math]\displaystyle{ D_\text{KL}(P(\theta) \parallel P(\theta_0)) = \frac{1}{2} \, \Delta\theta_j \, \Delta\theta_k \, g_{jk}(\theta_0) + \cdots }[/math]

where the Hessian matrix of the divergence

[math]\displaystyle{ g_{jk}(\theta_0) = \left.\frac{\partial^2}{\partial\theta_j\, \partial\theta_k} \right|_{\theta = \theta_0} D_\text{KL}(P(\theta) \parallel P(\theta_0)) }[/math]

must be positive semidefinite. Letting [math]\displaystyle{ \theta_0 }[/math] vary (and dropping the subindex 0) the Hessian [math]\displaystyle{ g_{jk}(\theta) }[/math] defines a (possibly degenerate) Riemannian metric on the θ parameter space, called the Fisher information metric.

Fisher information metric theorem

When [math]\displaystyle{ p_{(x, \rho)} }[/math] satisfies the following regularity conditions:

[math]\displaystyle{ \frac{\partial \log(p)}{\partial \rho}, \frac{\partial^2 \log(p)}{\partial \rho^2}, \frac{\partial^3 \log(p)}{\partial \rho^3} }[/math] exist,
[math]\displaystyle{ \begin{align} \left|\frac{\partial p}{\partial \rho}\right| &\lt F(x): \int_{x=0}^\infty F(x)\,dx \lt \infty, \\ \left|\frac{\partial^2 p}{\partial \rho^2}\right| &\lt G(x): \int_{x=0}^\infty G(x)\,dx \lt \infty \\ \left|\frac{\partial^3 \log(p)}{\partial \rho^3}\right| &\lt H(x): \int_{x=0}^\infty p(x, 0)H(x)\,dx \lt \xi \lt \infty \end{align} }[/math]

where ξ is independent of ρ

[math]\displaystyle{ \left.\int_{x=0}^\infty \frac{\partial p(x, \rho)}{\partial \rho}\right|_{\rho=0}\, dx = \left.\int_{x=0}^\infty \frac{\partial^2 p(x, \rho)}{\partial \rho^2}\right|_{\rho=0}\, dx = 0 }[/math]

then:

[math]\displaystyle{ \mathcal{D}(p(x, 0) \parallel p(x, \rho)) = \frac{c\rho^2}{2} + \mathcal{O}\left(\rho^3\right) \text{ as } \rho \to 0. }[/math]

Variation of information

Another information-theoretic metric is variation of information, which is roughly a symmetrization of conditional entropy. It is a metric on the set of partitions of a discrete probability space.

Relation to other quantities of information theory

Many of the other quantities of information theory can be interpreted as applications of relative entropy to specific cases.

Self-information

Main page: Information content

The self-information, also known as the information content of a signal, random variable, or event is defined as the negative logarithm of the probability of the given outcome occurring.

When applied to a discrete random variable, the self-information can be represented as

[math]\displaystyle{ \operatorname \operatorname{I}(m) = D_\text{KL}\left(\delta_\text{im} \parallel \{p_i\}\right), }[/math]

is the relative entropy of the probability distribution [math]\displaystyle{ P(i) }[/math] from a Kronecker delta representing certainty that [math]\displaystyle{ i = m }[/math] — i.e. the number of extra bits that must be transmitted to identify [math]\displaystyle{ i }[/math] if only the probability distribution [math]\displaystyle{ P(i) }[/math] is available to the receiver, not the fact that [math]\displaystyle{ i = m }[/math].

Mutual information

The mutual information,

[math]\displaystyle{ \begin{align} \operatorname{I}(X; Y) &= D_\text{KL}(P(X, Y) \parallel P(X)P(Y)) \\[5pt] &= \operatorname{E}_X \{D_\text{KL}(P(Y \mid X) \parallel P(Y))\} \\[5pt] &= \operatorname{E}_Y \{D_\text{KL}(P(X \mid Y) \parallel P(X))\} \end{align} }[/math]

is the relative entropy of the product [math]\displaystyle{ P(X)P(Y) }[/math] of the two marginal probability distributions from the joint probability distribution [math]\displaystyle{ P(X,Y) }[/math] — i.e. the expected number of extra bits that must be transmitted to identify [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] if they are coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability [math]\displaystyle{ P(X,Y) }[/math] is known, it is the expected number of extra bits that must on average be sent to identify [math]\displaystyle{ Y }[/math] if the value of [math]\displaystyle{ X }[/math] is not already known to the receiver.

Shannon entropy

The Shannon entropy,

[math]\displaystyle{ \begin{align} \Eta(X) &= \operatorname{E}\left[\operatorname{I}_X(x)\right] \\ &= \log(N) - D_\text{KL}\left(p_X(x) \parallel P_U(X)\right) \end{align} }[/math]

is the number of bits which would have to be transmitted to identify [math]\displaystyle{ X }[/math] from [math]\displaystyle{ N }[/math] equally likely possibilities, less the relative entropy of the uniform distribution on the random variates of [math]\displaystyle{ X }[/math], [math]\displaystyle{ P_U(X) }[/math], from the true distribution [math]\displaystyle{ P(X) }[/math] — i.e. less the expected number of bits saved, which would have had to be sent if the value of [math]\displaystyle{ X }[/math] were coded according to the uniform distribution [math]\displaystyle{ P_U(X) }[/math] rather than the true distribution [math]\displaystyle{ P(X) }[/math].

Conditional entropy

The conditional entropy[24],

[math]\displaystyle{ \begin{align} \Eta(X \mid Y) &= \log(N) - D_\text{KL}(P(X, Y) \parallel P_U(X) P(Y)) \\[5pt] &= \log(N) - D_\text{KL}(P(X, Y) \parallel P(X) P(Y)) - D_\text{KL}(P(X) \parallel P_U(X)) \\[5pt] &= \Eta(X) - \operatorname{I}(X; Y) \\[5pt] &= \log(N) - \operatorname{E}_Y \left[D_\text{KL}\left(P\left(X \mid Y\right) \parallel P_U(X)\right)\right] \end{align} }[/math]

is the number of bits which would have to be transmitted to identify [math]\displaystyle{ X }[/math] from [math]\displaystyle{ N }[/math] equally likely possibilities, less the relative entropy of the product distribution [math]\displaystyle{ P_U(X) P(Y) }[/math] from the true joint distribution [math]\displaystyle{ P(X,Y) }[/math] — i.e. less the expected number of bits saved which would have had to be sent if the value of [math]\displaystyle{ X }[/math] were coded according to the uniform distribution [math]\displaystyle{ P_U(X) }[/math] rather than the conditional distribution [math]\displaystyle{ P(X|Y) }[/math] of [math]\displaystyle{ X }[/math] given [math]\displaystyle{ Y }[/math].

Cross entropy

When we have a set of possible events, coming from the distribution p, we can encode them (with a lossless data compression) using entropy encoding. This compresses the data by replacing each fixed-length input symbol with a corresponding unique, variable-length, prefix-free code (e.g.: the events (A, B, C) with probabilities p = (1/2, 1/4, 1/4) can be encoded as the bits (0, 10, 11)). If we know the distribution p in advance, we can devise an encoding that would be optimal (e.g.: using Huffman coding). Meaning the messages we encode will have the shortest length on average (assuming the encoded events are sampled from p), which will be equal to Shannon's Entropy of p (denoted as [math]\displaystyle{ \Eta(p) }[/math]). However, if we use a different probability distribution (q) when creating the entropy encoding scheme, then a larger number of bits will be used (on average) to identify an event from a set of possibilities. This new (larger) number is measured by the cross entropy between p and q.

The cross entropy between two probability distributions (p and q) measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p. The cross entropy for two distributions p and q over the same probability space is thus defined as follows.

[math]\displaystyle{ \Eta(p, q) = \operatorname{E}_p[-\log(q)] = \Eta(p) + D_\text{KL}(p \parallel q). }[/math]

For explicit derivation of this, see the Motivation section above.

Under this scenario, relative entropies (kl-divergence) can be interpreted as the extra number of bits, on average, that are needed (beyond [math]\displaystyle{ \Eta(p) }[/math]) for encoding the events because of using q for constructing the encoding scheme instead of p.

Bayesian updating

In Bayesian statistics, relative entropy can be used as a measure of the information gain in moving from a prior distribution to a posterior distribution: [math]\displaystyle{ p(x) \to p(x\mid I) }[/math]. If some new fact [math]\displaystyle{ Y = y }[/math] is discovered, it can be used to update the posterior distribution for [math]\displaystyle{ X }[/math] from [math]\displaystyle{ p(x\mid I) }[/math] to a new posterior distribution [math]\displaystyle{ p(x\mid y,I) }[/math] using Bayes' theorem:

[math]\displaystyle{ p(x \mid y, I) = \frac{p(y \mid x, I) p(x \mid I)}{p(y \mid I)} }[/math]

This distribution has a new entropy:

[math]\displaystyle{ \Eta\big(p(x \mid y, I)\big) = -\sum_x p(x \mid y, I) \log p(x \mid y, I), }[/math]

which may be less than or greater than the original entropy [math]\displaystyle{ \Eta(p(x\mid I)) }[/math]. However, from the standpoint of the new probability distribution one can estimate that to have used the original code based on [math]\displaystyle{ p(x\mid I) }[/math] instead of a new code based on [math]\displaystyle{ p(x\mid y, I) }[/math] would have added an expected number of bits:

[math]\displaystyle{ D_\text{KL}\big(p(x \mid y, I) \parallel p(x \mid I) \big) = \sum_x p(x \mid y, I) \log\left(\frac{p(x \mid y, I)}{p(x \mid I)}\right) }[/math]

to the message length. This therefore represents the amount of useful information, or information gain, about [math]\displaystyle{ X }[/math], that has been learned by discovering [math]\displaystyle{ Y = y }[/math].

If a further piece of data, [math]\displaystyle{ Y_2 = y_2 }[/math], subsequently comes in, the probability distribution for [math]\displaystyle{ x }[/math] can be updated further, to give a new best guess [math]\displaystyle{ p(x \mid y_1, y_2, I) }[/math]. If one reinvestigates the information gain for using [math]\displaystyle{ p(x \mid y_1,I) }[/math] rather than [math]\displaystyle{ p(x \mid I) }[/math], it turns out that it may be either greater or less than previously estimated:

[math]\displaystyle{ \sum_x p(x \mid y_1, y_2, I) \log\left(\frac{p(x \mid y_1, y_2, I)}{p(x \mid I)}\right) }[/math] may be ≤ or > than [math]\displaystyle{ \displaystyle \sum_x p(x \mid y_1, I) \log\left(\frac{p(x \mid y_1, I)}{p(x \mid I)}\right) }[/math]

and so the combined information gain does not obey the triangle inequality:

[math]\displaystyle{ D_\text{KL} \big( p(x \mid y_1, y_2, I) \parallel p(x \mid I) \big) }[/math] may be <, = or > than [math]\displaystyle{ D_\text{KL}\big( p(x \mid y_1, y_2, I) \parallel p(x \mid y_1, I)\big) + D_\text{KL}\big(p(x \mid y_1, I) \parallel p(x \mid I)\big) }[/math]

All one can say is that on average, averaging using [math]\displaystyle{ p(y_2 \mid y_1, x, I) }[/math], the two sides will average out.

Bayesian experimental design

A common goal in Bayesian experimental design is to maximise the expected relative entropy between the prior and the posterior.[25] When posteriors are approximated to be Gaussian distributions, a design maximising the expected relative entropy is called Bayes d-optimal.

Discrimination information

Relative entropy [math]\displaystyle{ D_\text{KL}\bigl(p(x \mid H_1) \parallel p(x \mid H_0)\bigr) }[/math] can also be interpreted as the expected discrimination information for [math]\displaystyle{ H_1 }[/math] over [math]\displaystyle{ H_0 }[/math]: the mean information per sample for discriminating in favor of a hypothesis [math]\displaystyle{ H_1 }[/math] against a hypothesis [math]\displaystyle{ H_0 }[/math], when hypothesis [math]\displaystyle{ H_1 }[/math] is true.[26] Another name for this quantity, given to it by I. J. Good, is the expected weight of evidence for [math]\displaystyle{ H_1 }[/math] over [math]\displaystyle{ H_0 }[/math] to be expected from each sample.

The expected weight of evidence for [math]\displaystyle{ H_1 }[/math] over [math]\displaystyle{ H_0 }[/math] is not the same as the information gain expected per sample about the probability distribution [math]\displaystyle{ p(H) }[/math] of the hypotheses,

[math]\displaystyle{ D_\text{KL}(p(x \mid H_1) \parallel p(x \mid H_0)) \neq IG = D_\text{KL}(p(H \mid x) \parallel p(H \mid I)). }[/math]

Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.

On the entropy scale of information gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question.

Principle of minimum discrimination information

The idea of relative entropy as discrimination information led Kullback to propose the Principle of Minimum Discrimination Information (MDI): given new facts, a new distribution [math]\displaystyle{ f }[/math] should be chosen which is as hard to discriminate from the original distribution [math]\displaystyle{ f_0 }[/math] as possible; so that the new data produces as small an information gain [math]\displaystyle{ D_\text{KL}(f \parallel f_0) }[/math] as possible.

For example, if one had a prior distribution [math]\displaystyle{ p(x,a) }[/math] over [math]\displaystyle{ x }[/math] and [math]\displaystyle{ a }[/math], and subsequently learnt the true distribution of [math]\displaystyle{ a }[/math] was [math]\displaystyle{ u(a) }[/math], then the relative entropy between the new joint distribution for [math]\displaystyle{ x }[/math] and [math]\displaystyle{ a }[/math], [math]\displaystyle{ q(x\mid a)u(a) }[/math], and the earlier prior distribution would be:

[math]\displaystyle{ D_\text{KL}(q(x \mid a)u(a) \parallel p(x, a)) = \operatorname{E}_{u(a)}\left\{D_\text{KL}(q(x \mid a) \parallel p(x \mid a))\right\} + D_\text{KL}(u(a) \parallel p(a)), }[/math]

i.e. the sum of the relative entropy of [math]\displaystyle{ p(a) }[/math] the prior distribution for [math]\displaystyle{ a }[/math] from the updated distribution [math]\displaystyle{ u(a) }[/math], plus the expected value (using the probability distribution [math]\displaystyle{ u(a) }[/math]) of the relative entropy of the prior conditional distribution [math]\displaystyle{ p(x\mid a) }[/math] from the new conditional distribution [math]\displaystyle{ q(x\mid a) }[/math]. (Note that often the later expected value is called the conditional relative entropy (or conditional Kullback-Leibler divergence) and denoted by [math]\displaystyle{ D_\text{KL}(q(x\mid a) \parallel p(x\mid a)) }[/math] [3][24]:p. 22) This is minimized if [math]\displaystyle{ q(x\mid a)=p(x\mid a) }[/math] over the whole support of [math]\displaystyle{ u(a) }[/math]; and we note that this result incorporates Bayes' theorem, if the new distribution [math]\displaystyle{ u(a) }[/math] is in fact a δ function representing certainty that [math]\displaystyle{ a }[/math] has one particular value.

MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the relative entropy continues to be just as relevant.

In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. Minimising relative entropy from [math]\displaystyle{ m }[/math] to [math]\displaystyle{ p }[/math] with respect to [math]\displaystyle{ m }[/math] is equivalent to minimizing the cross-entropy of [math]\displaystyle{ p }[/math] and [math]\displaystyle{ m }[/math], since

[math]\displaystyle{ \Eta(p, m) = \Eta(p) + D_\text{KL}(p \parallel m), }[/math]

which is appropriate if one is trying to choose an adequate approximation to [math]\displaystyle{ p }[/math]. However, this is just as often not the task one is trying to achieve. Instead, just as often it is [math]\displaystyle{ m }[/math] that is some fixed prior reference measure, and [math]\displaystyle{ p }[/math] that one is attempting to optimise by minimising [math]\displaystyle{ D_\text{KL}(p \parallel m) }[/math] subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be [math]\displaystyle{ D_\text{KL}(p \parallel m) }[/math], rather than [math]\displaystyle{ \Eta(p,m) }[/math].

Relationship to available work

Pressure versus volume plot of available work from a mole of argon gas relative to ambient, calculated as [math]\displaystyle{ T_o }[/math] times the Kullback–Leibler divergence.

Surprisals[27] add where probabilities multiply. The surprisal for an event of probability [math]\displaystyle{ p }[/math] is defined as [math]\displaystyle{ s = k \ln(1 / p) }[/math]. If [math]\displaystyle{ k }[/math] is [math]\displaystyle{ \left\{1, 1/\ln 2, 1.38 \times 10^{-23}\right\} }[/math] then surprisal is in [math]\displaystyle{ \{ }[/math]nats, bits, or [math]\displaystyle{ J/K\} }[/math] so that, for instance, there are [math]\displaystyle{ N }[/math] bits of surprisal for landing all "heads" on a toss of [math]\displaystyle{ N }[/math] coins.

Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the average surprisal [math]\displaystyle{ S }[/math] (entropy) for a given set of control parameters (like pressure [math]\displaystyle{ P }[/math] or volume [math]\displaystyle{ V }[/math]). This constrained entropy maximization, both classically[28] and quantum mechanically,[29] minimizes Gibbs availability in entropy units[30] [math]\displaystyle{ A \equiv -k\ln(Z) }[/math] where [math]\displaystyle{ Z }[/math] is a constrained multiplicity or partition function.

When temperature [math]\displaystyle{ T }[/math] is fixed, free energy ([math]\displaystyle{ T \times A }[/math]) is also minimized. Thus if [math]\displaystyle{ T, V }[/math] and number of molecules [math]\displaystyle{ N }[/math] are constant, the Helmholtz free energy [math]\displaystyle{ F \equiv U - TS }[/math] (where [math]\displaystyle{ U }[/math] is energy and [math]\displaystyle{ S }[/math] is entropy) is minimized as a system "equilibrates." If [math]\displaystyle{ T }[/math] and [math]\displaystyle{ P }[/math] are held constant (say during processes in your body), the Gibbs free energy [math]\displaystyle{ G = U + PV - TS }[/math] is minimized instead. The change in free energy under these conditions is a measure of available work that might be done in the process. Thus available work for an ideal gas at constant temperature [math]\displaystyle{ T_o }[/math] and pressure [math]\displaystyle{ P_o }[/math] is [math]\displaystyle{ W = \Delta G = NkT_o\Theta(V/V_o) }[/math] where [math]\displaystyle{ V_o = NkT_o/P_o }[/math] and [math]\displaystyle{ \Theta(x) = x - 1 - \ln x \ge 0 }[/math] (see also Gibbs inequality).

More generally[31] the work available relative to some ambient is obtained by multiplying ambient temperature [math]\displaystyle{ T_o }[/math] by relative entropy or net surprisal [math]\displaystyle{ \Delta I \ge 0, }[/math] defined as the average value of [math]\displaystyle{ k\ln(p/p_o) }[/math] where [math]\displaystyle{ p_o }[/math] is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of [math]\displaystyle{ V_o }[/math] and [math]\displaystyle{ T_o }[/math] is thus [math]\displaystyle{ W = T_o \Delta I }[/math], where relative entropy

[math]\displaystyle{ \Delta I = Nk\left[\Theta\left(\frac{V}{V_o}\right) + \frac{3}{2}\Theta\left(\frac{T}{T_o}\right)\right]. }[/math]

The resulting contours of constant relative entropy, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here.[32] Thus relative entropy measures thermodynamic availability in bits.

Quantum information theory

For density matrices [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] on a Hilbert space, the quantum relative entropy from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is defined to be

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \operatorname{Tr}(P(\log(P) - \log(Q))). }[/math]

In quantum information science the minimum of [math]\displaystyle{ D_\text{KL}(P\parallel Q) }[/math] over all separable states [math]\displaystyle{ Q }[/math] can also be used as a measure of entanglement in the state [math]\displaystyle{ P }[/math].

Relationship between models and reality

Just as relative entropy of "actual from ambient" measures thermodynamic availability, relative entropy of "reality from a model" is also useful even if the only clues we have about reality are some experimental measurements. In the former case relative entropy describes distance to equilibrium or (when multiplied by ambient temperature) the amount of available work, while in the latter case it tells you about surprises that reality has up its sleeve or, in other words, how much the model has yet to learn.

Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to selecting a statistical model via Akaike information criterion are particularly well described in papers[33] and a book[34] by Burnham and Anderson. In a nutshell the relative entropy of reality from a model may be estimated, to within a constant additive term, by a function of the deviations observed between data and the model's predictions (like the mean squared deviation) . Estimates of such divergence for models that share the same additive term can in turn be used to select among models.

When trying to fit parametrized models to data there are various estimators which attempt to minimize relative entropy, such as maximum likelihood and maximum spacing estimators.

Symmetrised divergence

(Kullback Leibler) also considered the symmetrized function:[6]

[math]\displaystyle{ D_\text{KL}(P \parallel Q) + D_\text{KL}(Q \parallel P) }[/math]

which they referred to as the "divergence", though today the "KL divergence" refers to the asymmetric function (see § Etymology for the evolution of the term). This function is symmetric and nonnegative, and had already been defined and used by Harold Jeffreys in 1948;[7] it is accordingly called the Jeffreys divergence.

This quantity has sometimes been used for feature selection in classification problems, where [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] are the conditional pdfs of a feature under two different classes. In the Banking and Finance industries, this quantity is referred to as Population Stability Index (PSI), and is used to assess distributional shifts in model features through time.

An alternative is given via the [math]\displaystyle{ \lambda }[/math] divergence,

[math]\displaystyle{ D_\lambda(P \parallel Q) = \lambda D_\text{KL}(P \parallel \lambda P + (1 - \lambda)Q) + (1 - \lambda) D_\text{KL}(Q \parallel \lambda P + (1 - \lambda)Q), }[/math]

which can be interpreted as the expected information gain about [math]\displaystyle{ X }[/math] from discovering which probability distribution [math]\displaystyle{ X }[/math] is drawn from, [math]\displaystyle{ P }[/math] or [math]\displaystyle{ Q }[/math], if they currently have probabilities [math]\displaystyle{ \lambda }[/math] and [math]\displaystyle{ 1-\lambda }[/math] respectively.[clarification needed]

The value [math]\displaystyle{ \lambda = 0.5 }[/math] gives the Jensen–Shannon divergence, defined by

[math]\displaystyle{ D_\text{JS} = \frac{1}{2} D_\text{KL} (P \parallel M) + \frac{1}{2} D_\text{KL}(Q \parallel M) }[/math]

where [math]\displaystyle{ M }[/math] is the average of the two distributions,

[math]\displaystyle{ M = \frac{1}{2}(P + Q). }[/math]

[math]\displaystyle{ D_{JS} }[/math] can also be interpreted as the capacity of a noisy information channel with two inputs giving the output distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math]. The Jensen–Shannon divergence, like all f-divergences, is locally proportional to the Fisher information metric. It is similar to the Hellinger metric (in the sense that it induces the same affine connection on a statistical manifold).

Furthermore, the Jensen-Shannon divergence can be generalized using abstract statistical M-mixtures relying on an abstract mean M. [35] [36]

Relationship to other probability-distance measures

There are many other important measures of probability distance. Some of these are particularly connected with relative entropy. For example:

  • The total variation distance, [math]\displaystyle{ \delta(p,q) }[/math]. This is connected to the divergence through Pinsker's inequality: [math]\displaystyle{ \delta(P, Q) \le \sqrt{\frac{1}{2} D_\text{KL}(P \parallel Q)}. }[/math] Pinsker's inequality is vacuous for any distributions where [math]\displaystyle{ D_{\mathrm{KL}}(P\parallel Q)\gt 2 }[/math], since the total variation distance is at most [math]\displaystyle{ 1 }[/math]. For such distributions, an alternative bound can be used, due to Bretagnolle and Huber[37] (see, also, Tsybakov[38]): [math]\displaystyle{ \delta(P,Q) \le \sqrt{1-e^{ -D_{\mathrm{KL}}(P\parallel Q) }}. }[/math]
  • The family of Rényi divergences generalize relative entropy. Depending on the value of a certain parameter, [math]\displaystyle{ \alpha }[/math], various inequalities may be deduced.

Other notable measures of distance include the Hellinger distance, histogram intersection, Chi-squared statistic, quadratic form distance, match distance, Kolmogorov–Smirnov distance, and earth mover's distance.[39]

Data differencing

Main page: Data differencing

Just as absolute entropy serves as theoretical background for data compression, relative entropy serves as theoretical background for data differencing – the absolute entropy of a set of data in this sense being the data required to reconstruct it (minimum compressed size), while the relative entropy of a target set of data, given a source set of data, is the data required to reconstruct the target given the source (minimum size of a patch).

See also


References

  1. 1.0 1.1 Csiszar, I (February 1975). "I-Divergence Geometry of Probability Distributions and Minimization Problems". Ann. Probab. 3 (1): 146-158. doi:10.1214/aop/1176996454. https://projecteuclid.org/journals/annals-of-probability/volume-3/issue-1/I-Divergence-Geometry-of-Probability-Distributions-and-Minimization-Problems/10.1214/aop/1176996454.full. 
  2. "On information and sufficiency". Annals of Mathematical Statistics 22 (1): 79–86. 1951. doi:10.1214/aoms/1177729694. 
  3. 3.0 3.1 3.2 Kullback, S. (1959), Information Theory and Statistics, John Wiley & Sons . Republished by Dover Publications in 1968; reprinted in 1978: ISBN 0-8446-5625-9.
  4. 4.0 4.1 4.2 4.3 4.4 Amari 2016, p. 11.
  5. 5.0 5.1 Amari 2016, p. 28.
  6. 6.0 6.1 Kullback & Leibler 1951, p. 80.
  7. 7.0 7.1 Jeffreys 1948, p. 158.
  8. Kullback 1959, p. 7.
  9. "Letter to the Editor: The Kullback–Leibler distance". The American Statistician 41 (4): 340–341. 1987. doi:10.1080/00031305.1987.10475510. 
  10. Kullback 1959, p. 6.
  11. MacKay, David J.C. (2003). Information Theory, Inference, and Learning Algorithms (First ed.). Cambridge University Press. p. 34. ISBN 9780521642989. https://books.google.com/books?id=AKuMj4PN_EMC&q=%22Kullback%E2%80%93Leibler+divergence%22. 
  12. "machine learning - What's the maximum value of Kullback-Leibler (KL) divergence". https://stats.stackexchange.com/questions/351947/whats-the-maximum-value-of-kullback-leibler-kl-divergence. 
  13. "integration - In what situations is the integral equal to infinity?". https://math.stackexchange.com/questions/20961/in-what-situations-is-the-integral-equal-to-infinity. 
  14. Bishop C. (2006). Pattern Recognition and Machine Learning
  15. Burnham, K. P.; Anderson, D. R. (2002). Model Selection and Multi-Model Inference (2nd ed.). Springer. p. 51. ISBN 9780387953649. https://archive.org/details/modelselectionmu0000burn. 
  16. Hobson, Arthur (1971). Concepts in statistical mechanics.. New York: Gordon and Breach. ISBN 978-0677032405. 
  17. Sanov, I.N. (1957). "On the probability of large deviations of random magnitudes". Mat. Sbornik 42 (84): 11–44. 
  18. Novak S.Y. (2011), Extreme Value Methods with Applications to Finance ch. 14.5 (Chapman & Hall). ISBN 978-1-4398-3574-6.
  19. A bot will complete this citation soon. Click here to jump the queue arXiv:2008.05932.
  20. See the section "differential entropy – 4" in Relative Entropy video lecture by Sergio Verdú NIPS 2009
  21. Donsker, Monroe D.; Varadhan, SR Srinivasa (1983). "Asymptotic evaluation of certain Markov process expectations for large time. IV.". Communications on Pure and Applied Mathematics 36 (2): 183–212. doi:10.1002/cpa.3160360204. 
  22. Lee, Se Yoon (2021). "Gibbs sampler and coordinate ascent variational inference: A set-theoretical review". Communications in Statistics - Theory and Methods. doi:10.1080/03610926.2021.1921214. 
  23. Duchi J., "Derivations for Linear Algebra and Optimization".
  24. 24.0 24.1 Cover, Thomas M.; Thomas, Joy A. (1991), Elements of Information Theory, John Wiley & Sons 
  25. Chaloner, K.; Verdinelli, I. (1995). "Bayesian experimental design: a review". Statistical Science 10 (3): 273–304. doi:10.1214/ss/1177009939. 
  26. Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. (2007). "Section 14.7.2. Kullback–Leibler Distance". Numerical Recipes: The Art of Scientific Computing (3rd ed.). Cambridge University Press. ISBN 978-0-521-88068-8. http://apps.nrbook.com/empanel/index.html#pg=756. 
  27. Myron Tribus (1961), Thermodynamics and Thermostatics (D. Van Nostrand, New York)
  28. Jaynes, E. T. (1957). "Information theory and statistical mechanics". Physical Review 106 (4): 620–630. doi:10.1103/physrev.106.620. Bibcode1957PhRv..106..620J. http://bayes.wustl.edu/etj/articles/theory.1.pdf. 
  29. Jaynes, E. T. (1957). "Information theory and statistical mechanics II". Physical Review 108 (2): 171–190. doi:10.1103/physrev.108.171. Bibcode1957PhRv..108..171J. http://bayes.wustl.edu/etj/articles/theory.2.pdf. 
  30. J.W. Gibbs (1873), "A method of geometrical representation of thermodynamic properties of substances by means of surfaces", reprinted in The Collected Works of J. W. Gibbs, Volume I Thermodynamics, ed. W. R. Longley and R. G. Van Name (New York: Longmans, Green, 1931) footnote page 52.
  31. Tribus, M.; McIrvine, E. C. (1971). "Energy and information". Scientific American 224 (3): 179–186. doi:10.1038/scientificamerican0971-179. Bibcode1971SciAm.225c.179T. 
  32. Fraundorf, P. (2007). "Thermal roots of correlation-based complexity". Complexity 13 (3): 18–26. doi:10.1002/cplx.20195. Bibcode2008Cmplx..13c..18F. http://www3.interscience.wiley.com/cgi-bin/abstract/117861985/ABSTRACT. 
  33. Burnham, K.P.; Anderson, D.R. (2001). "Kullback–Leibler information as a basis for strong inference in ecological studies". Wildlife Research 28 (2): 111–119. doi:10.1071/WR99107. 
  34. Burnham, K. P. and Anderson D. R. (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, Second Edition (Springer Science) ISBN 978-0-387-95364-9.
  35. Nielsen, Frank (2019). "On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means". Entropy 21 (5): 485. doi:10.3390/e21050485. PMID 33267199. Bibcode2019Entrp..21..485N. 
  36. Nielsen, Frank (2020). "On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid". Entropy 22 (2): 221. doi:10.3390/e22020221. PMID 33285995. Bibcode2020Entrp..22..221N. 
  37. Bretagnolle, J.; Huber, C, Estimation des densités: risque minimax, Séminaire de Probabilités, XII (Univ. Strasbourg, Strasbourg, 1976/1977), pp. 342–363, Lecture Notes in Math., 649, Springer, Berlin, 1978, Lemma 2.1 (French).
  38. Tsybakov, Alexandre B., Introduction to nonparametric estimation, Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer Series in Statistics. Springer, New York, 2009. xii+214 pp. ISBN:978-0-387-79051-0, Equation 2.25.
  39. Rubner, Y.; Tomasi, C.; Guibas, L. J. (2000). "The earth mover's distance as a metric for image retrieval". International Journal of Computer Vision 40 (2): 99–121. doi:10.1023/A:1026543900054. 

External links