Ball divergence

From HandWiki
Short description: Nonparametric two-sample test methods

Ball Divergence (BD) is a nonparametric two‐sample statistic that quantifies the discrepancy between two probability measures μ and ν on a metric space (V,ρ).[1] It is defined by integrating the squared difference of the measures over all closed balls in V. Let B(u,r)={wVρ(u,w)r} be the closed ball of radius r0 centered at uV. Equivalently, one may set r=ρ(u,v) and write B(u,ρ(u,v)). The Ball divergence is then defined by BD(μ,ν)=V×V[μ(B(u,ρ(u,v)))ν(B(u,ρ(u,v)))]2[μ(du)μ(dv)+ν(du)ν(dv)]. This measure can be seen as an integral of the Harald Cramér's distance over all possible pairs of points. By summing squared differences of μ and ν over balls of all scales, BD captures both global and local discrepancies between distributions, yielding a robust, scale-sensitive comparison. Moreover, since BD is defined as the integral of a squared measure difference, it is always non-negative, and BD(μ,ν)=0 if and only if μ=ν.

Testing for equal distributions

Next, we will try to give a sample version of Ball Divergence. For convenience, we can decompose the Ball Divergence into two parts: A=V×V[μν]2(B¯(u,ρ(u,v)))μ(du)μ(dv), and C=V×V[μν]2(B¯(u,ρ(u,v)))ν(du)ν(dv). Thus BD(μ,ν)=A+C.

Let δ(x,y,z)=I(zB¯(x,ρ(x,y))) denote whether point z locates in the ball B¯(x,ρ(x,y)). Given two independent samples {X1,,Xn} form μ and {Y1,,Ym} form ν

AijX=1nu=1nδ(Xi,Xj,Xu),AijY=1mv=1mδ(Xi,Xj,Yv),CklX=1nu=1nδ(Yk,Yl,Xu),CijY=1mv=1mδ(Yk,Yl,Yv), where AijX means the proportion of samples from the probability measure μ located in the ball B¯(Xi,ρ(Xi,Xj)) and AijY means the proportion of samples from the probability measure ν located in the ball B¯(Xi,ρ(Xi,Xj)). Meanwhile, CijX and CijY means the proportion of samples from the probability measure μ and ν located in the ball B¯(Yi,ρ(Yi,Yj)). The sample versions of A and C are as follows

An,m=1n2i,j=1n(AijXAijY)2,Cn,m=1m2k,l=1m(CklXCklY)2.

Finally, we can give the sample ball divergence

BDn,m=An,m+Cn,m.

It can be proved that BDn,m is a consistent estimator of BD. Moreover, if nn+mτ for some τ[0,1], then under the null hypothesis BDn,m converges in distribution to a mixture of chi-squared distributions, whereas under the alternative hypothesis it converges to a normal distribution.

Properties

  1. The square root of Ball Divergence is a symmetric divergence but not a metric, because it does not satisfy the triangle inequality.
  2. It can be shown that Ball divergence, energy distance test,[2] and MMD[3] are unified within the variogram framework; for details see Remark 2.4 in.[1]

Homogeneity Test

Ball divergence admits a straightforward extension to the K-sample setting. Suppose μ1,,μK are K(2) probability measures on a Banach space (V,). Define the K-sample BD by

D(μ1,,μK)=1k<lKV×V[μk(B(u,ρ(u,v)))μl(B(u,ρ(u,v)))]2[μk(du)μk(dv)+μl(du)μl(dv)].

It then follows from Theorems 1 and 2 that D(μ1,,μK)=0 if and only if μ1=μ2==μK.

By employing closed balls to define a metric distribution function, one obtains an alternative homogeneity measure.[4]

Given a probability measure μ~ on a metric space (V,ρ), its metric distribution function is defined by

Fμ~M(u,v)=μ~(B(u,ρ(u,v)))=𝔼[δ(u,v,X)],u,vV,

where B(u,r)={wV:d(u,w)r} is the closed ball of radius r0 centered at u, and δ(u,v,X)=k=1K𝟏{X(k)Bk(uk,ρk(uk,vk))}.


If (X1,,XN) are i.i.d. draws from (μ~), the empirical version is

Fμ~,NM(u,v)=1Ni=1Nδ(u,v,Xi).

Based on these, the homogeneity measure based on MDF, also called metric Cramér-von Mises (MCVM) is MCVM(μkμ)=V×Vpk2w(u,v)[FμkM(u,v)FμM(u,v)]2dμk(u)dμk(v),

where μ=k=1Kpkμk be their mixture with weights p1,,pK, and w(u,v)=exp(d(u,v)22σ2). The overall MCVM is then

MCVM(μ1,,μK)=k=1Kpk2MCVM(μkμ).

The empirical MCVM is given by

MCVM^(μkμ)=1nk2Xi(k),Xj(k)𝒳kw(Xi(k),Xj(k))[Fμk,nkM(Xi(k),Xj(k))Fμ,nM(Xi(k),Xj(k))]2.

where 𝒳k={X1(k),,Xnk(k)} be an i.i.d. sample from μk, and p^k=nk=1Kn. A practical choice for σ2 is the median of the squared distances {d(X,X)2:X,Xk=1K𝒳k}.

References

  1. 1.0 1.1 Pan, Wenliang; Tian, Yuan; Wang, Xueqin; Zhang, Heping (2018-06-01). "Ball Divergence: Nonparametric two sample test". The Annals of Statistics 46 (3): 1109–1137. doi:10.1214/17-AOS1579. ISSN 0090-5364. PMID 30344356. 
  2. Székely, Gábor J.; Rizzo, Maria L. (August 2013). "Energy statistics: A class of statistics based on distances". Journal of Statistical Planning and Inference 143 (8): 1249–1272. doi:10.1016/j.jspi.2013.03.018. ISSN 0378-3758. http://dx.doi.org/10.1016/j.jspi.2013.03.018. 
  3. Gretton, Arthur; Borgwardt, Karsten M.; Rasch, Malte; Schölkopf, Bernhard; Smola, Alexander J. (2007-09-07), "A Kernel Method for the Two-Sample-Problem", Advances in Neural Information Processing Systems 19 (The MIT Press): pp. 513–520, doi:10.7551/mitpress/7503.003.0069, ISBN 978-0-262-25691-9, http://dx.doi.org/10.7551/mitpress/7503.003.0069, retrieved 2024-06-28 
  4. Wang, X., Zhu, J., Pan, W., Zhu, J., & Zhang, H. (2023). Nonparametric Statistical Inference via Metric Distribution Function in Metric Spaces. Journal of the American Statistical Association, 119(548), 2772–2784. https://doi.org/10.1080/01621459.2023.2277417