Sub-Gaussian distribution

From HandWiki
Short description: Probability distribution whose tail probability is less than some gaussian, in some sense.


In probability theory, a subgaussian distribution, the distribution of a subgaussian random variable, is a probability distribution with strong tail decay. More specifically, the tails of a subgaussian distribution are dominated by (i.e. decay at least as fast as) the tails of a Gaussian. This property gives subgaussian distributions their name.

Often in analysis, we divide an object (such as a random variable) into two parts, a central bulk and a distant tail, then analyze each separately. In probability, this division usually goes like "Everything interesting happens near the center. The tail event is so rare, we may safely ignore that." Subgaussian distributions are worthy of study, because the gaussian distribution is well-understood, and so we can give sharp bounds on the rarity of the tail event. Similarly, the subexponential distributions are also worthy of study.

Formally, the probability distribution of a random variable X is called subgaussian if there is a positive constant C such that for every t0,

P(|X|t)2exp(t2/C2).

There are many equivalent definitions. For example, a random variable X is sub-Gaussian iff its distribution function is bounded from above (up to a constant) by the distribution function of a Gaussian:

P(|X|t)cP(|Z|t)t>0

where c0 is constant and Z is a mean zero Gaussian random variable.[1]: Theorem 2.6 

Definitions

Subgaussian norm

The subgaussian norm of X, denoted as Xψ2, isXψ2=inf{c>0:E[exp(X2c2)]2}.In other words, it is the Orlicz norm of X generated by the Orlicz function Φ(u)=eu21. By condition (2) below, subgaussian random variables can be characterized as those random variables with finite subgaussian norm.

Variance proxy

If there exists some s2 such that E[e(XE[X])t]es2t22 for all t, then s2 is called a variance proxy, and the smallest such s2 is called the optimal variance proxy and denoted by Xvp2.

Since E[e(XE[X])t]=eσ2t22 when X𝒩(μ,σ2) is Gaussian, we then have Xvp2=σ2, as it should.

Equivalent definitions

Let X be a random variable with zero mean. Let K1,K2,K3, be positive constants. The following conditions are equivalent: (Proposition 2.5.2 [2])

  1. Tail probability bound: P(|X|t)2exp(t2/K12) for all t0;
  2. Finite subgaussian norm: Xψ2=K2<;
  3. Moment: E|X|p2K3pΓ(p2+1) for all p1, where Γ is the Gamma function;
  4. Moment: E|X|pKppp/2 for all p1;
  5. Moment-generating function (of X), or variance proxy[3][4] : E[e(XE[X])t]eK2t22 for all t;
  6. Moment-generating function (of X2): E[eX2t2]eK2t2 for all t[1/K,+1/K];
  7. Union bound: for some c > 0,  E[max{|X1E[X]|,,|XnE[X]|}]clogn for all n > c, where X1,,Xn are i.i.d copies of X;
  8. Subexponential: X2 has a subexponential distribution.

Furthermore, the constant K is the same in the definitions (1) to (5), up to an absolute constant. So for example, given a random variable satisfying (1) and (2), the minimal constants K1,K2 in the two definitions satisfy K1cK2,K2cK1, where c,c are constants independent of the random variable.

Proof of equivalence

As an example, the first four definitions are equivalent by the proof below.

Proof. (1)(3) By the layer cake representation,E|X|p=0P(|X|pt)dt=0ptp1P(|X|t)dt20ptp1exp(t2K12)dt


After a change of variables u=t2/K12, we find thatE|X|p2K1pp20up21eudu=2K1pp2Γ(p2)=2K1pΓ(p2+1).(3)(2) By the Taylor series ex=1+p=1xpp!,E[exp(λX2)]=1+p=1λpE[X2p]p!1+p=12λpK32pΓ(p+1)p!=1+2p=1λpK32p=2p=0λpK32p1=21λK321for λK32<1,which is less than or equal to 2 for λ13K32. Let K2312K3, then E[exp(X2/K22)]2.


(2)(1) By Markov's inequality,P(|X|t)=P(exp(X2K22)exp(t2K22))E[exp(X2/K22)]exp(t2K22)2exp(t2K22).(3)(4) by asymptotic formula for gamma function: Γ(p/2+1)πp(p2e)p/2.

From the proof, we can extract a cycle of three inequalities:

  • If P(|X|t)2exp(t2/K2), then E|X|p2KpΓ(p2+1) for all p1.
  • If E|X|p2KpΓ(p2+1) for all p1, then Xψ2312K.
  • If Xψ2K, then P(|X|t)2exp(t2/K2).

In particular, the constant K provided by the definitions are the same up to a constant factor, so we can say that the definitions are equivalent up to a constant independent of X.

Similarly, because up to a positive multiplicative constant, Γ(p/2+1)=pp/2×((2e)1/2p1/2p)p for all p1, the definitions (3) and (4) are also equivalent up to a constant.

Basic properties

Basic properties — * If X is subgaussian, and k>0, then kXψ2=kXψ2 and kXvp=kXvp.

  • (Triangle inequality) If X,Y are subgaussian, then X+Yvp2(Xvp+Yvp)2
  • (Chernoff bound) If X is subgaussian, then Pr(Xt)et22Xvp2 for all t0

XX means that XCX, where the positive constant C is independent of X and X.

Subgaussian deviation bound — If X is subgaussian, then XE[X]ψ2Xψ2

Independent subgaussian sum bound — If X,Y are subgaussian and independent, then X+Yvp2Xvp2+Yvp2

Corollary — Linear sums of subgaussian random variables are subgaussian.

Partial converse (Matoušek 2008, Lemma 2.4) — If E[X]=0,E[X2]=1, and lnPr(Xt)12at2 for all t>0, then lnE[etX]Cat2 where Ca>0 depends on a only.

Proof

Corollary (Matoušek 2008, Lemma 2.2) — X1,,Xn are independent random variables with the same upper subgaussian tail: lnPr(Xit)12at2 for all t>0. Also, E[Xi]=0,E[Xi2]=1, then for any unit vector vn, the linear sum iviXi has a subgaussian tail:
lnPr(iviXit)Cat2 where Ca>0 depends only on a.


Concentration

Gaussian concentration inequality for Lipschitz functions (Tao 2012, Theorem 2.1.12.) — If f:n is L-Lipschitz, and XN(0,I) is a standard gaussian vector, then f(X) concentrates around its expectation at a rate Pr(f(X)E[f(X)]t)e2π2t2L2 and similarly for the other tail.

Proof


Strictly subgaussian

Expanding the cumulant generating function:12s2t2lnE[etX]=12Var[X]t2+κ3t3+we find that Var[X]Xvp2. At the edge of possibility, we define that a random variable X satisfying Var[X]=Xvp2 is called strictly subgaussian.

Properties

Theorem.[5] Let X be a subgaussian random variable with mean zero. If all zeros of its characteristic function are real, then X is strictly subgaussian.

Corollary. If X1,,Xn are independent and strictly subgaussian, then any linear sum of them is strictly subgaussian.

Examples

By calculating the characteristic functions, we can show that some distributions are strictly subgaussian: symmetric uniform distribution, symmetric Bernoulli distribution.

Since a symmetric uniform distribution is strictly subgaussian, its convolution with itself is strictly subgaussian. That is, the symmetric triangular distribution is strictly subgaussian.

Since the symmetric Bernoulli distribution is strictly subgaussian, any symmetric Binomial distribution is strictly subgaussian.

Examples

Xψ2 Xvp2 strictly subgaussian?
gaussian distribution 𝒩(0,1) 8/3 1 Yes
mean-zero Bernoulli distribution pδq+qδp solution to pe(q/t)2+qe(p/t)2=2 pq2(logplogq) Iff p=0,1/2,1
symmetric Bernoulli distribution 12δ1/2+12δ1/2 1ln2 1 Yes
uniform distribution U(0,1) solution to 01ex2/t2dx=2, approximately 0.7727 1/3 Yes
arbitrary distribution on interval [a,b] (ba2)2

The optimal variance proxy Xvp2 is known for many standard probability distributions, including the beta, Bernoulli, Dirichlet[6], Kumaraswamy, triangular[7], truncated Gaussian, and truncated exponential.[8]

Bernoulli distribution

Let p+q=1 be two positive numbers. Let X be a centered Bernoulli distribution pδq+qδp, so that it has mean zero, then Xvp2=pq2(logplogq).[5] Its subgaussian norm is t where t is the unique positive solution to pe(q/t)2+qe(p/t)2=2.

Let X be a random variable with symmetric Bernoulli distribution (or Rademacher distribution). That is, X takes values 1 and 1 with probabilities 1/2 each. Since X2=1, it follows thatXψ2=inf{c>0:E[exp(X2c2)]2}=inf{c>0:exp(1c2)2}=1ln2,and hence X is a subgaussian random variable.

Bounded distributions

Some commonly used bounded distributions.

Bounded distributions have no tail at all, so clearly they are subgaussian.

If X is bounded within the interval [a,b], Hoeffding's lemma states that Xvp2(ba2)2. Hoeffding's inequality is the Chernoff bound obtained using this fact.

Convolutions

Density of a mixture of three normal distributions (μ = 5, 10, 15, σ = 2) with equal weights. Each component is shown as a weighted density (each integrating to 1/3)

Since the sum of subgaussian random variables is still subgaussian, the convolution of subgaussian distributions is still subgaussian. In particular, any convolution of the normal distribution with any bounded distribution is subgaussian.

Mixtures

Given subgaussian distributions X1,X2,,Xn, we can construct an additive mixture X as follows: first randomly pick a number i{1,2,,n}, then pick Xi.

Since E[exp(X2c2)]=ipiE[exp(Xi2c2)]we have Xψ2maxiXiψ2, and so the mixture is subgaussian.

In particular, any gaussian mixture is subgaussian.

More generally, the mixture of infinitely many subgaussian distributions is also subgaussian, if the subgaussian norm has a finite supremum: Xψ2supiXiψ2.

Subgaussian random vectors

So far, we have discussed subgaussianity for real-valued random variables. We can also define subgaussianity for random vectors. The purpose of subgaussianity is to make the tails decay fast, so we generalize accordingly: a subgaussian random vector is a random vector where the tail decays fast.

Let X be a random vector taking values in n.

Define.

  • Xψ2:=supvSn1vTXψ2, where Sn1 is the unit sphere in n. Similarly for the variance proxy Xvp:=supvSn1vTXvp
  • X is subgaussian iff Xψ2<.

Theorem. (Theorem 3.4.6 [2]) For any positive integer n, the uniformly distributed random vector XU(nSn1) is subgaussian, with Xψ21.

This is not so surprising, because as n, the projection of U(nSn1) to the first coordinate converges in distribution to the standard normal distribution.

Maximum inequalities

Theorem — If X1,,Xn are mean-zero subgaussians, with Xivp2σ2, then for any δ>0, we have max(X1,,Xn)σ2lnnδ with probability 1δ.


Theorem (Exercise 2.5.10 [2]) — If X1,X2, are subgaussians, with Xiψ2K, then E[supn|Xn|1+lnn]K,E[max1nN|Xn|]KlnNFurther, the bound is sharp, since when X1,X2, are IID samples of 𝒩(0,1) we have E[max1nN|Xn|]lnN.[9]


Theorem (over a finite set[10]) — If X1,,Xn are subgaussian, with Xivp2σ2, thenE[maxi(XiE[Xi])]σ2lnn,Pr(maxi(XiE[Xi])>t)net22σ2,E[maxi|XiE[Xi]|]σ2ln(2n),Pr(maxi|XiE[Xi]|>t)2net22σ2

Corollary (over a convex polytope) — Fix a finite set of vectors v1,,vn. If X is a random vector, such that each viTXvp2σ2, then the above 4 inequalities hold, with maxvconv(v1,,vn)(vTXE[vTX]) replacing maxi(XiE[Xi]). Here, conv(v1,,vn) is the convex polytope hulled by the vectors v1,,vn.

Theorem (subgaussian random vectors) — If X is a random vector in d, such that vTXvp2σ2 for all v on the unit sphere S, then E[maxvSvTX]=E[maxvS|vTX|]4σdFor any δ>0, with probability at least 1δ,maxvSvTX=maxvS|vTX|4σd+2σ2log(1/δ)

Inequalities

Theorem. (Theorem 2.6.1 [2]) There exists a positive constant C such that given any number of independent mean-zero subgaussian random variables X1,,Xn, i=1nXiψ22Ci=1nXiψ22Theorem. (Hoeffding's inequality) (Theorem 2.6.3 [2]) There exists a positive constant c such that given any number of independent mean-zero subgaussian random variables X1,,XN,(|i=1NXi|t)2exp(ct2i=1NXiψ22)t>0Theorem. (Bernstein's inequality) (Theorem 2.8.1 [2]) There exists a positive constant c such that given any number of independent mean-zero subexponential random variables X1,,XN,(|i=1NXi|t)2exp(cmin(t2i=1NXiψ12,tmaxiXiψ1)) Theorem. (Khinchine inequality) (Exercise 2.6.5 [2]) There exists a positive constant C such that given any number of independent mean-zero variance-one subgaussian random variables X1,,XN, any p2, and any a1,,aN,(i=1Nai2)1/2i=1NaiXiLpCKp(i=1Nai2)1/2

Hanson-Wright inequality

The Hanson-Wright inequality states that if a random vector X is subgaussian in a certain sense, then any quadratic form A of this vector, XTAX, is also subgaussian/subexponential. Further, the upper bound on the tail of XTAX, is uniform.

A weak version of the following theorem was proved in (Hanson, Wright, 1971).[11] There are many extensions and variants. Much like the central limit theorem, the Hanson-Wright inequality is more a cluster of theorems with the same purpose, than a single theorem. The purpose is to take a subgaussian vector and uniformly bound its quadratic forms.

Theorem.[12][13] There exists a constant c, such that:

Let n be a positive integer. Let X1,...,Xn be independent random variables, such that each satisfies E[Xi]=0. Combine them into a random vector X=(X1,,Xn). For any n×n matrix A, we haveP(|XTAXE[XTAX]|>t)max(2ect2K4AF2,2ectK2A)=2exp[cmin(t2K4AF2,tK2A)]where K=maxiXiψ2, and AF=ijAij2 is the Frobenius norm of the matrix, and A=maxx2=1Ax2 is the operator norm of the matrix.

In words, the quadratic form XTAX has its tail uniformly bounded by an exponential, or a gaussian, whichever is larger.


In the statement of the theorem, the constant c is an "absolute constant", meaning that it has no dependence on n,X1,,Xn,A. It is a mathematical constant much like pi and e.

Consequences

Theorem (subgaussian concentration).[12] There exists a constant c, such that:

Let n,m be positive integers. Let X1,...,Xn be independent random variables, such that each satisfies E[Xi]=0,E[Xi2]=1. Combine them into a random vector X=(X1,,Xn). For any m×n matrix A, we haveP(|AX2AF|>t)2ect2K4A2In words, the random vector AX is concentrated on a spherical shell of radius AF, such that AX2AF is subgaussian, with subgaussian norm 3/cAK2.

See also

Notes

  1. Wainwright MJ. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge: Cambridge University Press; 2019. doi:10.1017/9781108627771, ISBN 9781108627771.
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge: Cambridge University Press. 
  3. Kahane, J. (1960). "Propriétés locales des fonctions à séries de Fourier aléatoires". Studia Mathematica 19: 1–25. doi:10.4064/sm-19-1-1-25. 
  4. Buldygin, V. V.; Kozachenko, Yu. V. (1980). "Sub-Gaussian random variables". Ukrainian Mathematical Journal 32 (6): 483–489. doi:10.1007/BF01087176. 
  5. 5.0 5.1 Bobkov, S. G.; Chistyakov, G. P.; Götze, F. (2023-08-03). "Strictly subgaussian probability distributions". arXiv:2308.01749 [math.PR].
  6. Marchal, Olivier; Arbel, Julyan (2017). "On the sub-Gaussianity of the Beta and Dirichlet distributions". Electronic Communications in Probability 22. doi:10.1214/17-ECP92. 
  7. Arbel, Julyan; Marchal, Olivier; Nguyen, Hien D. (2020). "On strict sub-Gaussianity, optimal proxy variance and symmetry for bounded random variables". Esaim: Probability and Statistics 24: 39–55. doi:10.1051/ps/2019018. 
  8. Barreto, Mathias; Marchal, Olivier; Arbel, Julyan (2024). "Optimal sub-Gaussian variance proxy for truncated Gaussian and exponential random variables". arXiv:2403.08628 [math.ST].
  9. Kamath, Gautam. "Bounds on the expectation of the maximum of samples from a gaussian." (2015)
  10. "MIT 18.S997 | Spring 2015 | High-Dimensional Statistics, Chapter 1. Sub-Gaussian Random Variables" (in en). https://ocw.mit.edu/courses/18-s997-high-dimensional-statistics-spring-2015/a69e2f53bb2eeb9464520f3027fc61e6_MIT18_S997S15_Chapter1.pdf. 
  11. Hanson, D. L.; Wright, F. T. (1971). "A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables". The Annals of Mathematical Statistics 42 (3): 1079–1083. doi:10.1214/aoms/1177693335. ISSN 0003-4851. 
  12. 12.0 12.1 Rudelson, Mark; Vershynin, Roman (January 2013). "Hanson-Wright inequality and sub-gaussian concentration". Electronic Communications in Probability 18 (none): 1–9. doi:10.1214/ECP.v18-2865. ISSN 1083-589X. https://projecteuclid.org/journals/electronic-communications-in-probability/volume-18/issue-none/Hanson-Wright-inequality-and-sub-gaussian-concentration/10.1214/ECP.v18-2865.full. 
  13. Vershynin, Roman (2018). "6. Quadratic Forms, Symmetrization, and Contraction". High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge University Press. pp. 127–146. doi:10.1017/9781108231596.009. ISBN 978-1-108-41519-4. https://doi.org/10.1017/9781108231596.009. 

References