Integral probability metric

From HandWiki
Short description: Class of distance functions defined between probability distributions


In probability theory, integral probability metrics are types of distance functions between probability distributions, defined by how well a class of functions can distinguish the two distributions. Many important statistical distances are integral probability metrics, including the Wasserstein-1 distance and the total variation distance. In addition to theoretical importance, integral probability metrics are widely used in areas of statistics and machine learning.

The name "integral probability metric" was given by German statistician Alfred Müller;[1] the distances had also previously been called "metrics with a ζ-structure."[2]

Definition

Integral probability metrics (IPMs) are distances on the space of distributions over a set 𝒳, defined by a class of real-valued functions on 𝒳 as D(P,Q)=supf|𝔼XPf(X)𝔼YQf(Y)|=supf|PfQf|; here the notation P f refers to the expectation of f under the distribution P. The absolute value in the definition is unnecessary, and often omitted, for the usual case where for every f its negation f is also in .

The functions f being optimized over are sometimes called "critic" functions;[3] if a particular f* achieves the supremum, it is often termed a "witness function"[4] (it "witnesses" the difference in the distributions). These functions try to have large values for samples from P and small (likely negative) values for samples from Q; this can be thought of as a weaker version of classifers, and indeed IPMs can be interpreted as the optimal risk of a particular classifier.[5]: sec. 4 

The choice of determines the particular distance; more than one can generate the same distance.[1]

For any choice of , D satisfies all the definitions of a metric except that we may have we may have D(P,Q)=0 for some PQ; this is variously termed a "pseudometric" or a "semimetric" depending on the community. For instance, using the class ={x0} which only contains the zero function, D(P,Q) is identically zero. D is a metric if and only if separates points on the space of probability distributions, i.e. for any PQ there is some f such that PfQf;[1] most, but not all, common particular cases satisfy this property.

Examples

All of these examples are metrics except when noted otherwise.

Relationship to f-divergences

The f-divergences are probably the best-known way to measure dissimilarity of probability distributions. It has been shown[5]: sec. 2  that the only functions which are both IPMs and f-divergences are of the form cTV(P,Q), where c[0,] and TV is the total variation distance between distributions.

One major difference between f-divergences and most IPMs is that when P and Q have disjoint support, all f-divergences take on a constant value;[17] by contrast, IPMs where functions in are "smooth" can give "partial credit." For instance, consider the sequence δ1/n of Dirac measures at 1/n; this sequence converges in distribution to δ0, and many IPMs satisfy D(δ1/n,δ0)0, but no nonzero f-divergence can satisfy this. That is, many IPMs are continuous in weaker topologies than f-divergences. This property is sometimes of substantial importance,[18] although other options also exist, such as considering f-divergences between distributions convolved with continuous noise[18][19] or informal convolutions between f-divergences and integral probability metrics.[20][21]

Estimation from samples

Because IPM values between discrete distributions are often sensible, it is often reasonable to estimate D(P,Q) using a simple "plug-in" estimator: D(P^,Q^) where P^ and Q^ are empirical measures of sample sets. These empirical distances can be computed exactly for some classes ;[5] estimation quality varies depending on the distance, but can be minimax-optimal in certain settings.[14][22][23]

When exact maximization is not available or too expensive, another commonly used scheme is to divide the samples into "training" sets (with empirical measures P^train and Q^train) and "test" sets (P^test and Q^test), find f^ approximately maximizing |P^trainfQ^trainf|, then use |P^testf^Q^testf^| as an estimate.[24][12][25][26] This estimator can possibly be consistent, but has a negative bias[24]: thm. 2 . In fact, no unbiased estimator can exist for any IPM[24]: thm. 3 , although there is for instance an unbiased estimator of the squared maximum mean discrepancy.[4]

References

  1. 1.0 1.1 1.2 Müller, Alfred (June 1997). "Integral Probability Metrics and Their Generating Classes of Functions". Advances in Applied Probability 29 (2): 429–443. doi:10.2307/1428011. 
  2. Zolotarev, V. M. (January 1984). "Probability Metrics". Theory of Probability & Its Applications 28 (2): 278–302. doi:10.1137/1128025. 
  3. Arjovsky, Martin; Chintala, Soumith; Bottou, Léon (2017-07-17). "Wasserstein Generative Adversarial Networks" (in en). International Conference on Machine Learning (PMLR): 214–223. https://proceedings.mlr.press/v70/arjovsky17a.html. 
  4. 4.0 4.1 Gretton, Arthur; Borgwardt, Karsten M.; Rasche, Malte J.; Schölkopf, Bernhard; Smola, Alexander (2012). "A Kernel Two-Sample Test". Journal of Machine Learning Research 13: 723–773. https://jmlr.org/papers/volume13/gretton12a/gretton12a.pdf. 
  5. 5.0 5.1 5.2 Sriperumbudur, Bharath K.; Fukumizu, Kenji; Gretton, Arthur; Schölkopf, Bernhard; Lanckriet, Gert R. G. (2009). "On integral probability metrics, φ-divergences and binary classification". arXiv:0901.2698 [cs.IT].
  6. Fukumizu, Kenji; Gretton, Arthur; Sun, Xiaohui; Schölkopf, Bernhard (2007). "Kernel Measures of Conditional Dependence". Advances in Neural Information Processing Systems 20. https://papers.nips.cc/paper_files/paper/2007/hash/3a0772443a0739141292a5429b952fe6-Abstract.html. 
  7. Sejdinovic, Dino; Sriperumbudur, Bharath; Gretton, Arthur; Fukumizu, Kenji (2013). "Equivalence of distance-based and RKHS-based statistics in hypothesis testing". The Annals of Statistics 41 (5): 2263–2291. doi:10.1214/13-aos1140. 
  8. Mroueh, Youssef; Li, Chun-Liang; Sercu, Tom; Raj, Anant; Cheng, Yu (2018). "Sobolev GAN". International Conference on Learning Representations. https://openreview.net/forum?id=SJA7xfb0b. 
  9. Uppal, Ananya; Singh, Shashank; Póczos, Barnabás (2019). "Nonparametric Density Estimation & Convergence Rates for GANs under Besov IPM Losses". Advances in Neural Information Processing Systems 32. https://papers.nips.cc/paper_files/paper/2019/hash/fb09f481d40c4d3c0861a46bd2dc52c0-Abstract.html. 
  10. Uppal, Ananya; Singh, Shashank; Póczos, Barnabás (2020). "Robust Density Estimation under Besov IPM Losses". Advances in Neural Information Processing Systems 33: 5345–5355. https://papers.nips.cc/paper_files/paper/2020/hash/39d4b545fb02556829aab1db805021c3-Abstract.html. 
  11. Kim, Ilmun; Ramdas, Aaditya; Singh, Aarti; Wasserman, Larry (February 2021). "Classification accuracy as a proxy for two-sample testing". The Annals of Statistics 49 (1). doi:10.1214/20-AOS1962. 
  12. 12.0 12.1 Lopez-Paz, David; Oquab, Maxime (2017). "Revisiting Classifier Two-Sample Tests". International Conference on Learning Representations. https://openreview.net/forum?id=SJkXfE5xx. 
  13. 13.0 13.1 Arora, Sanjeev; Ge, Rong; Liang, Yingyu; Ma, Tengyu; Zhang, Yi (2017). "Generalization and Equilibrium in Generative Adversarial Nets (GANs)". International Conference on Machine Learning. 
  14. 14.0 14.1 Ji, Kaiyi; Liang, Yingbin (2018). "Minimax Estimation of Neural Net Distance". Advances in Neural Information Processing Systems. 
  15. Stanczuk, Jan; Etmann, Christian; Lisa Maria Kreusser; Schönlieb, Carola-Bibiane (2021). "Wasserstein GANs Work Because They Fail (To Approximate the Wasserstein Distance)". arXiv:2103.01678 [stat.ML].
  16. Mallasto, Anton; Montúfar, Guido; Gerolin, Augusto (2019). "How Well do WGANs Estimate the Wasserstein Metric?". arXiv:1910.03875 [cs.LG].
  17. Sutherland, Danica J.. "Computing Jensen-Shannon Divergence between discrete and continuous distribution". Stack Exchange Network. https://stats.stackexchange.com/a/621718. Retrieved 18 July 2023. 
  18. 18.0 18.1 Arjovsky, Martin; Bettou, Léon (2017). "Towards Principled Methods for Training Generative Adversarial Networks". International Conference on Learning Representations. 
  19. Sønderby, Casper Kaae; Caballero, Jose; Theis, Lucas; Shi, Wenzhe; Huszár, Ferenc (2017). "Amortised MAP Inference for Image Super-resolution". International Conference on Learning Representations: Appendix C. 
  20. Stein, Viktor; Neumayer, Sebastian; Rux, Nicolaj; Steidl, Gabriele (2025-05-03). "Wasserstein gradient flows for Moreau envelopes of f-divergences in reproducing kernel Hilbert spaces" (in en). Analysis and Applications: 1–45. doi:10.1142/S0219530525500162. ISSN 0219-5305. https://www.worldscientific.com/doi/10.1142/S0219530525500162. 
  21. Birrell, Jeremiah; Dupuis, Paul; Katsoulakis, Markos A.; Pantazis, Yannis; Rey-Bellet, Luc (2022-01-01). "(f, Γ)-Divergences: interpolating between f-divergences and integral probability metrics". J. Mach. Learn. Res. 23 (1): 39:1816–39:1885. ISSN 1532-4435. https://dl.acm.org/doi/abs/10.5555/3586589.3586628. 
  22. Tolstikhin, Ilya O.; Sriperumbudur, Bharath K.; Schölkopf, Bernhard (2016). "Minimax Estimation of Maximum Mean Discrepancy with Radial Kernels". Advances in Neural Information Processing Systems 29. https://papers.nips.cc/paper_files/paper/2016/hash/5055cbf43fac3f7e2336b27310f0b9ef-Abstract.html. 
  23. Singh, Shashank; Póczos, Barnabás (2018). "Minimax Distribution Estimation in Wasserstein Distance". arXiv:1802.08855 [math.ST].
  24. 24.0 24.1 24.2 Bińkowski, Mikołaj; Sutherland, Danica J.; Arbel, Michael; Gretton, Arthur (2018). "Demystifying MMD GANs". International Conference on Learning Representations. https://openreview.net/forum?id=r1lUOzWCW. 
  25. Liu, Feng; Xu, Wenkai; Lu, Jie; Zhang, Guangquan; Gretton, Arthur; Sutherland, Danica J. (2020). "Learning Deep Kernels for Non-Parametric Two-Sample Tests". International Conference on Machine Learning: 6316–6326. https://proceedings.mlr.press/v119/liu20m.html. 
  26. Kübler, Jonas M.; Jitkrittum, Wittawat; Schölkopf, Bernhard; Muandent, Krikamol (2021). "A Witness Two-Sample Test". International Conference on Artificial Intelligence and Statistics: 1403–1419. https://proceedings.mlr.press/v151/kubler22a.html.