Jensen–Shannon divergence

From HandWiki
Short description: Statistical distance measure

In probability theory and statistics, the JensenShannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad)[1][2] or total divergence to the average.[3] It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance.[4][5][6]

Definition

Consider the set [math]\displaystyle{ M_+^1(A) }[/math] of probability distributions where [math]\displaystyle{ A }[/math] is a set provided with some σ-algebra of measurable subsets. In particular we can take [math]\displaystyle{ A }[/math] to be a finite or countable set with all subsets being measurable.

The Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of the Kullback–Leibler divergence [math]\displaystyle{ D(P \parallel Q) }[/math]. It is defined by

[math]\displaystyle{ {\rm JSD}(P \parallel Q)= \frac{1}{2}D(P \parallel M)+\frac{1}{2}D(Q \parallel M), }[/math]

where [math]\displaystyle{ M=\frac{1}{2}(P+Q) }[/math] is a mixture distribution of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math].

The geometric Jensen–Shannon divergence[7] (or G-Jensen–Shannon divergence) yields a closed-form formula for divergence between two Gaussian distributions by taking the geometric mean.

A more general definition, allowing for the comparison of more than two probability distributions, is:

[math]\displaystyle{ \begin{align} {\rm JSD}_{\pi_1, \ldots, \pi_n}(P_1, P_2, \ldots, P_n) &= \sum_i \pi_i D( P_i \parallel M ) \\ &= H\left(M\right) - \sum_{i=1}^n \pi_i H(P_i) \end{align} }[/math]

where

[math]\displaystyle{ \begin{align} M &:= \sum_{i=1}^n \pi_i P_i \end{align} }[/math]

and [math]\displaystyle{ \pi_1, \ldots, \pi_n }[/math] are weights that are selected for the probability distributions [math]\displaystyle{ P_1, P_2, \ldots, P_n }[/math], and [math]\displaystyle{ H(P) }[/math] is the Shannon entropy for distribution [math]\displaystyle{ P }[/math]. For the two-distribution case described above,

[math]\displaystyle{ P_1=P, P_2=Q, \pi_1 = \pi_2 = \frac{1}{2}.\ }[/math]

Hence, for those distributions [math]\displaystyle{ P, Q }[/math]

[math]\displaystyle{ JSD = H(M) - \frac{1}{2}\bigg(H(P) + H(Q)\bigg) }[/math]

Bounds

The Jensen–Shannon divergence is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm:[8]

[math]\displaystyle{ 0 \leq {\rm JSD}( P \parallel Q ) \leq 1 }[/math].

With this normalization, it is a lower bound on the total variation distance between P and Q:

[math]\displaystyle{ {\rm JSD}(P\parallel Q) \le \frac12\|P-Q\|_1=\frac12\sum_{\omega\in\Omega}|P(\omega)-Q(\omega)| }[/math].

With base-e logarithm, which is commonly used in statistical thermodynamics, the upper bound is [math]\displaystyle{ \ln(2) }[/math]. In general, the bound in base b is [math]\displaystyle{ \log_{b}(2) }[/math]:

[math]\displaystyle{ 0 \leq {\rm JSD}( P \parallel Q ) \leq \log_b(2) }[/math].

A more general bound, the Jensen–Shannon divergence is bounded by [math]\displaystyle{ \log_{b}(n) }[/math] for more than two probability distributions:[8]

[math]\displaystyle{ 0 \leq {\rm JSD}_{\pi_1, \ldots, \pi_n}(P_1, P_2, \ldots, P_n) \leq \log_{b}(n) }[/math].

Relation to mutual information

The Jensen–Shannon divergence is the mutual information between a random variable [math]\displaystyle{ X }[/math] associated to a mixture distribution between [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] and the binary indicator variable [math]\displaystyle{ Z }[/math] that is used to switch between [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] to produce the mixture. Let [math]\displaystyle{ X }[/math] be some abstract function on the underlying set of events that discriminates well between events, and choose the value of [math]\displaystyle{ X }[/math] according to [math]\displaystyle{ P }[/math] if [math]\displaystyle{ Z = 0 }[/math] and according to [math]\displaystyle{ Q }[/math] if [math]\displaystyle{ Z = 1 }[/math], where [math]\displaystyle{ Z }[/math] is equiprobable. That is, we are choosing [math]\displaystyle{ X }[/math] according to the probability measure [math]\displaystyle{ M=(P+Q)/2 }[/math], and its distribution is the mixture distribution. We compute

[math]\displaystyle{ \begin{align} I(X; Z) &= H(X) - H(X|Z)\\ &= -\sum M \log M + \frac{1}{2} \left[ \sum P \log P + \sum Q \log Q \right] \\ &= -\sum \frac{P}{2} \log M - \sum \frac{Q}{2} \log M + \frac{1}{2} \left[ \sum P \log P + \sum Q \log Q \right] \\ &= \frac{1}{2} \sum P \left( \log P - \log M\right ) + \frac{1}{2} \sum Q \left( \log Q - \log M \right) \\ &= {\rm JSD}(P \parallel Q) \end{align} }[/math]

It follows from the above result that the Jensen–Shannon divergence is bounded by 0 and 1 because mutual information is non-negative and bounded by [math]\displaystyle{ H(Z) = 1 }[/math] in base 2 logarithm.

One can apply the same principle to a joint distribution and the product of its two marginal distribution (in analogy to Kullback–Leibler divergence and mutual information) and to measure how reliably one can decide if a given response comes from the joint distribution or the product distribution—subject to the assumption that these are the only two possibilities.[9]

Quantum Jensen–Shannon divergence

The generalization of probability distributions on density matrices allows to define quantum Jensen–Shannon divergence (QJSD).[10][11] It is defined for a set of density matrices [math]\displaystyle{ (\rho_1,\ldots,\rho_n) }[/math] and a probability distribution [math]\displaystyle{ \pi=(\pi_1,\ldots,\pi_n) }[/math] as

[math]\displaystyle{ {\rm QJSD}(\rho_1,\ldots,\rho_n)= S\left(\sum_{i=1}^n \pi_i \rho_i\right)-\sum_{i=1}^n \pi_i S(\rho_i) }[/math]

where [math]\displaystyle{ S(\rho) }[/math] is the von Neumann entropy of [math]\displaystyle{ \rho }[/math]. This quantity was introduced in quantum information theory, where it is called the Holevo information: it gives the upper bound for amount of classical information encoded by the quantum states [math]\displaystyle{ (\rho_1,\ldots,\rho_n) }[/math] under the prior distribution [math]\displaystyle{ \pi }[/math] (see Holevo's theorem).[12] Quantum Jensen–Shannon divergence for [math]\displaystyle{ \pi=\left(\frac{1}{2},\frac{1}{2}\right) }[/math] and two density matrices is a symmetric function, everywhere defined, bounded and equal to zero only if two density matrices are the same. It is a square of a metric for pure states,[13] and it was recently shown that this metric property holds for mixed states as well.[14][15] The Bures metric is closely related to the quantum JS divergence; it is the quantum analog of the Fisher information metric.

Jensen–Shannon centroid

The centroid C* of a finite set of probability distributions can be defined as the minimizer of the average sum of the Jensen-Shannon divergences between a probability distribution and the prescribed set of distributions: [math]\displaystyle{ C^*=\arg\min_{Q} \sum_{i=1}^n {\rm JSD}(P_i \parallel Q) }[/math] An efficient algorithm[16] (CCCP) based on difference of convex functions is reported to calculate the Jensen-Shannon centroid of a set of discrete distributions (histograms).

Applications

The Jensen–Shannon divergence has been applied in bioinformatics and genome comparison,[17][18] in protein surface comparison,[19] in the social sciences,[20] in the quantitative study of history,[21] in fire experiments,[22] and in machine learning.[23]

Notes

  1. Frank Nielsen (2021). "On a variational definition for the Jensen-Shannon symmetrization of distances based on the information radius". Entropy (MDPI) 23 (4): 464. doi:10.3390/e21050485. PMID 33267199. 
  2. Hinrich Schütze; Christopher D. Manning (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press. p. 304. ISBN 978-0-262-13360-9. http://nlp.stanford.edu/fsnlp/. 
  3. Dagan, Ido; Lillian Lee; Fernando Pereira (1997). "Similarity-Based Methods For Word Sense Disambiguation". Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics: 56–63. doi:10.3115/979617.979625. Bibcode1997cmp.lg....8010D. http://citeseer.ist.psu.edu/dagan97similaritybased.html. Retrieved 2008-03-09. 
  4. Endres, D. M.; J. E. Schindelin (2003). "A new metric for probability distributions". IEEE Trans. Inf. Theory 49 (7): 1858–1860. doi:10.1109/TIT.2003.813506. https://research-repository.st-andrews.ac.uk/bitstream/10023/1591/1/Endres2003-IEEETransInfTheory49-NewMetric.pdf. 
  5. Ôsterreicher, F.; I. Vajda (2003). "A new class of metric divergences on probability spaces and its statistical applications". Ann. Inst. Statist. Math. 55 (3): 639–653. doi:10.1007/BF02517812. 
  6. Fuglede, B.; Topsoe, F. (2004). "Jensen-Shannon divergence and Hilbert space embedding". Proceedings of the International Symposium on Information Theory, 2004. IEEE. pp. 30. doi:10.1109/ISIT.2004.1365067. ISBN 978-0-7803-8280-0. http://www.math.ku.dk/~topsoe/ISIT2004JSD.pdf. 
  7. Frank Nielsen (2019). "On the Jensen-Shannon symmetrization of distances relying on abstract means". Entropy (MDPI) 21 (5): 485. doi:10.3390/e21050485. PMID 33267199. Bibcode2019Entrp..21..485N. 
  8. 8.0 8.1 Lin, J. (1991). "Divergence measures based on the shannon entropy". IEEE Transactions on Information Theory 37 (1): 145–151. doi:10.1109/18.61115. https://www.cise.ufl.edu/~anand/sp06/jensen-shannon.pdf. 
  9. Schneidman, Elad; Bialek, W; Berry, M.J. 2nd (2003). "Synergy, Redundancy, and Independence in Population Codes". Journal of Neuroscience 23 (37): 11539–11553. doi:10.1523/JNEUROSCI.23-37-11539.2003. PMID 14684857. 
  10. Majtey, A.; Lamberti, P.; Prato, D. (2005). "Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states". Physical Review A 72 (5): 052310. doi:10.1103/PhysRevA.72.052310. Bibcode2005PhRvA..72e2310M. 
  11. Briët, Jop; Harremoës, Peter (2009). "Properties of classical and quantum Jensen-Shannon divergence". Physical Review A 79 (5): 052311. doi:10.1103/PhysRevA.79.052311. Bibcode2009PhRvA..79e2311B. 
  12. Holevo, A. S. (1973), "Bounds for the quantity of information transmitted by a quantum communication channel" (in ru), Problemy Peredachi Informatsii 9: 3–11 . English translation: Probl. Inf. Transm., 9: 177–183 (1975) MR456936
  13. Braunstein, Samuel; Caves, Carlton (1994). "Statistical distance and the geometry of quantum states". Physical Review Letters 72 (22): 3439–3443. doi:10.1103/PhysRevLett.72.3439. PMID 10056200. Bibcode1994PhRvL..72.3439B. 
  14. Virosztek, Dániel (2021). "The metric property of the quantum Jensen-Shannon divergence". Advances in Mathematics 380: 107595. doi:10.1016/j.aim.2021.107595. 
  15. Sra, Suvrit (2019). "Metrics Induced by Quantum Jensen-Shannon-Renyí and Related Divergences". arXiv:1911.02643 [cs.IT].
  16. Frank Nielsen (2021). "On a generalization of the Jensen-Shannon divergence and the Jensen--Shannon centroid". Entropy (MDPI) 22 (2): 221. doi:10.3390/e22020221. PMID 33285995. 
  17. Sims, GE; Jun, SR; Wu, GA; Kim, SH (2009). "Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions". Proceedings of the National Academy of Sciences of the United States of America 106 (8): 2677–82. doi:10.1073/pnas.0813249106. PMID 19188606. Bibcode2009PNAS..106.2677S. 
  18. Itzkovitz, S; Hodis, E; Segal, E (2010). "Overlapping codes within protein-coding sequences". Genome Research 20 (11): 1582–9. doi:10.1101/gr.105072.110. PMID 20841429. 
  19. Ofran, Y; Rost, B (2003). "Analysing six types of protein-protein interfaces". Journal of Molecular Biology 325 (2): 377–87. doi:10.1016/s0022-2836(02)01223-8. PMID 12488102. 
  20. DeDeo, Simon; Hawkins, Robert X. D.; Klingenstein, Sara; Hitchcock, Tim (2013). "Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems". Entropy 15 (6): 2246–2276. doi:10.3390/e15062246. Bibcode2013Entrp..15.2246D. 
  21. Klingenstein, Sara; Hitchcock, Tim; DeDeo, Simon (2014). "The civilizing process in London's Old Bailey". Proceedings of the National Academy of Sciences of the United States of America 111 (26): 9419–9424. doi:10.1073/pnas.1405984111. PMID 24979792. Bibcode2014PNAS..111.9419K. 
  22. Flavia-Corina Mitroi-Symeonidis; Ion Anghel; Nicuşor Minculete (2020). "Parametric Jensen-Shannon statistical complexity and its applications on full-scale compartment fire data". Symmetry 12 (1): 22. doi:10.3390/sym12010022. 
  23. Goodfellow, Ian J.; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). "Generative Adversarial Networks". NIPS. Bibcode2014arXiv1406.2661G. 

External links