Information projection
In information theory, the information projection or I-projection of a probability distribution q onto a set of distributions P is
- [math]\displaystyle{ p^* = \underset{p \in P}{\arg\min} \operatorname{D}_{\mathrm{KL}}(p||q) }[/math].
where [math]\displaystyle{ D_{\mathrm{KL}} }[/math] is the Kullback–Leibler divergence from q to p. Viewing the Kullback–Leibler divergence as a measure of distance, the I-projection [math]\displaystyle{ p^* }[/math] is the "closest" distribution to q of all the distributions in P.
The I-projection is useful in setting up information geometry, notably because of the following inequality, valid when P is convex:[1]
[math]\displaystyle{ \operatorname{D}_{\mathrm{KL}}(p||q) \geq \operatorname{D}_{\mathrm{KL}}(p||p^*) + \operatorname{D}_{\mathrm{KL}}(p^*||q) }[/math].
This inequality can be interpreted as an information-geometric version of Pythagoras' triangle-inequality theorem, where KL divergence is viewed as squared distance in a Euclidean space.
It is worthwhile to note that since [math]\displaystyle{ \operatorname{D}_{\mathrm{KL}}(p||q) \geq 0 }[/math] and continuous in p, if P is closed and non-empty, then there exists at least one minimizer to the optimization problem framed above. Furthermore, if P is convex, then the optimum distribution is unique.
The reverse I-projection also known as moment projection or M-projection is
- [math]\displaystyle{ p^* = \underset{p \in P}{\arg\min} \operatorname{D}_{\mathrm{KL}}(q||p) }[/math].
Since the KL divergence is not symmetric in its arguments, the I-projection and the M-projection will exhibit different behavior. For I-projection, [math]\displaystyle{ p(x) }[/math] will typically under-estimate the support of [math]\displaystyle{ q(x) }[/math] and will lock onto one of its modes. This is due to [math]\displaystyle{ p(x)=0 }[/math], whenever [math]\displaystyle{ q(x)=0 }[/math] to make sure KL divergence stays finite. For M-projection, [math]\displaystyle{ p(x) }[/math] will typically over-estimate the support of [math]\displaystyle{ q(x) }[/math]. This is due to [math]\displaystyle{ p(x) \gt 0 }[/math] whenever [math]\displaystyle{ q(x) \gt 0 }[/math] to make sure KL divergence stays finite.
The reverse I-projection plays a fundamental role in the construction of optimal e-variables.
The concept of information projection can be extended to arbitrary f-divergences and other divergences.[2]
See also
References
- ↑ Cover, Thomas M.; Thomas, Joy A. (2006). Elements of Information Theory (2 ed.). Hoboken, New Jersey: Wiley Interscience. p. 367 (Theorem 11.6.1).
- ↑ Nielsen, Frank (2018). What is... an information projection?. 65. AMS. pp. 321–324. https://www.ams.org/journals/notices/201803/rnoti-p321.pdf.
- K. Murphy, "Machine Learning: a Probabilistic Perspective", The MIT Press, 2012.
Original source: https://en.wikipedia.org/wiki/Information projection.
Read more |