Information projection

In information theory, the information projection or I-projection of a probability distribution q onto a set of distributions P is

[math]\displaystyle{ p^* = \underset{p \in P}{\arg\min} \operatorname{D}_{\mathrm{KL}}(p||q) }[/math].

where [math]\displaystyle{ D_{\mathrm{KL}} }[/math] is the Kullback–Leibler divergence from q to p. Viewing the Kullback–Leibler divergence as a measure of distance, the I-projection [math]\displaystyle{ p^* }[/math] is the "closest" distribution to q of all the distributions in P.

The I-projection is useful in setting up information geometry, notably because of the following inequality, valid when P is convex:^[1]

[math]\displaystyle{ \operatorname{D}_{\mathrm{KL}}(p||q) \geq \operatorname{D}_{\mathrm{KL}}(p||p^*) + \operatorname{D}_{\mathrm{KL}}(p^*||q) }[/math].

This inequality can be interpreted as an information-geometric version of Pythagoras' triangle-inequality theorem, where KL divergence is viewed as squared distance in a Euclidean space.

It is worthwhile to note that since [math]\displaystyle{ \operatorname{D}_{\mathrm{KL}}(p||q) \geq 0 }[/math] and continuous in p, if P is closed and non-empty, then there exists at least one minimizer to the optimization problem framed above. Furthermore, if P is convex, then the optimum distribution is unique.

The reverse I-projection also known as moment projection or M-projection is

[math]\displaystyle{ p^* = \underset{p \in P}{\arg\min} \operatorname{D}_{\mathrm{KL}}(q||p) }[/math].

Since the KL divergence is not symmetric in its arguments, the I-projection and the M-projection will exhibit different behavior. For I-projection, [math]\displaystyle{ p(x) }[/math] will typically under-estimate the support of [math]\displaystyle{ q(x) }[/math] and will lock onto one of its modes. This is due to [math]\displaystyle{ p(x)=0 }[/math], whenever [math]\displaystyle{ q(x)=0 }[/math] to make sure KL divergence stays finite. For M-projection, [math]\displaystyle{ p(x) }[/math] will typically over-estimate the support of [math]\displaystyle{ q(x) }[/math]. This is due to [math]\displaystyle{ p(x) \gt 0 }[/math] whenever [math]\displaystyle{ q(x) \gt 0 }[/math] to make sure KL divergence stays finite.

The reverse I-projection plays a fundamental role in the construction of optimal e-variables.

The concept of information projection can be extended to arbitrary f-divergences and other divergences.^[2]

References

↑ Cover, Thomas M.; Thomas, Joy A. (2006). Elements of Information Theory (2 ed.). Hoboken, New Jersey: Wiley Interscience. p. 367 (Theorem 11.6.1).
↑ Nielsen, Frank (2018). What is... an information projection?. 65. AMS. pp. 321–324. https://www.ams.org/journals/notices/201803/rnoti-p321.pdf.

K. Murphy, "Machine Learning: a Probabilistic Perspective", The MIT Press, 2012.

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Information projection. Read more

[1] Cover, Thomas M.; Thomas, Joy A. (2006). Elements of Information Theory (2 ed.). Hoboken, New Jersey: Wiley Interscience. p. 367 (Theorem 11.6.1).

[2] Nielsen, Frank (2018). What is... an information projection?. 65. AMS. pp. 321–324. https://www.ams.org/journals/notices/201803/rnoti-p321.pdf.

[1]

[2]

Anonymous

Search

Information projection

Namespaces

More

Page actions

See also

References

Navigation

Navigation

Help

Translate

Wiki tools

Wiki tools

Anonymous

Search

Information projection

See also

References

Navigation

Wiki tools

Page tools

Other projects

Categories