Bretagnolle-Huber inequality

From HandWiki
Short description: Inequality in information theory

In information theory, the Bretagnolle-Huber inequality bounds the total variation distance between two probability distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] by a concave and bounded function of the Kullback-Leibler divergence [math]\displaystyle{ D_\mathrm{KL}(P \parallel Q) }[/math]. The bound can be viewed as an alternative to the well-known Pinsker's inequality: when [math]\displaystyle{ D_\mathrm{KL}(P \parallel Q) }[/math] is large (larger than 2 for instance.[1]), Pinsker's inequality is vacuous, while Bretagnolle-Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing[2]

Formal statement

Preliminary definitions

Let [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] be two probability distributions on a measurable space [math]\displaystyle{ (\mathcal{X}, \mathcal{F}) }[/math]. Recall that the total variation between [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] is defined by

[math]\displaystyle{ d_\mathrm{TV}(P,Q) = \sup_{A \in \mathcal{F}} \{|P(A)-Q(A)| \}. }[/math]

The Kullback-Leibler divergence is defined as follows:

[math]\displaystyle{ D_\mathrm{KL}(P \parallel Q) = \begin{cases} \int_{\mathcal{X}} \log\bigl(\frac{dP}{dQ}\bigr)\, dP & \text{if} \ P \ll Q, \\[1mm] +\infty & \text{otherwise}. \end{cases} }[/math]

In the above, the notation [math]\displaystyle{ P\ll Q }[/math] stands for absolute continuity of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math], and [math]\displaystyle{ \frac{dP}{dQ} }[/math] stands for the Radon-Nikodym derivative of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math].

General Statement

The Bretagnolle-Huber inequality says:

[math]\displaystyle{ d_\mathrm{TV}(P,Q) \leq \sqrt{1-\exp(-D_\mathrm{KL}(P \parallel Q))} \leq 1 - \frac{1}{2}\exp(-D_\mathrm{KL}(P \parallel Q)) }[/math]

Alternative version

The following version is directly implied by the bound above but some authors[2] prefer stating it this way. Let [math]\displaystyle{ A\in \mathcal{F} }[/math] be any event. Then

[math]\displaystyle{ P(A) + Q(\bar{A}) \geq \frac{1}{2}\exp(-D_\mathrm{KL}(P \parallel Q)) }[/math]

Where [math]\displaystyle{ \bar{A} = \Omega \setminus A }[/math] is the complement of [math]\displaystyle{ A }[/math].

Indeed, by definition of the total variation, for any [math]\displaystyle{ A \in \mathcal{F} }[/math],

[math]\displaystyle{ \begin{align} Q(A) - P(A) \leq d_\mathrm{TV}(P,Q) & \leq 1- \frac{1}{2}\exp(-D_\mathrm{KL}(P \parallel Q)) \\ & = Q(A) + Q(\bar{A}) - \frac{1}{2}\exp(-D_\mathrm{KL}(P \parallel Q)) \end{align} }[/math]

Rearranging, we obtain the claimed lower bound on [math]\displaystyle{ P(A)+Q(\bar{A}) }[/math].

Proof

We prove the main statement following the ideas in Tsybakov's book (Lemma 2.6, page 89),[3] which differ from the original proof[4] (see C.Canonne's note [1] for a modernized retranscription of their argument).

The proof is in two steps:

1. Prove using Cauchy-Schwarz that the total variation is related to the Bhattacharyya coefficient (right-hand side of the inequality):

[math]\displaystyle{ 1-d_\mathrm{TV}(P,Q)^2 \geq \left(\int \sqrt{PQ}\right)^2 }[/math]

2. Prove by a clever application of Jensen’s inequality that

[math]\displaystyle{ \left(\int \sqrt{PQ}\right)^2 \geq \exp(-D_\mathrm{KL}(P \parallel Q)) }[/math]
  • Step 1:

First notice that

[math]\displaystyle{ d_\mathrm{TV}(P,Q) = 1-\int \min(P,Q) = \int \max(P,Q) -1 }[/math]

To see this, denote [math]\displaystyle{ A^* = \arg\max_{A\in \Omega} |P(A)-Q(A)| }[/math] and without loss of generality, assume that [math]\displaystyle{ P(A^*)\gt Q(A^*) }[/math] such that [math]\displaystyle{ d_\mathrm{TV}(P,Q)=P(A^*)-Q(A^*) }[/math]. Then we can rewrite

[math]\displaystyle{ d_\mathrm{TV}(P,Q) = \int_{A^*} \max(P,Q) - \int_{A^*} \min(P,Q) }[/math]

And then adding and removing [math]\displaystyle{ \int_{\bar{A^*}} \max(P,Q) \text{ or } \int_{\bar{A^*}}\min(P,Q) }[/math] we obtain both identities.

Then,

[math]\displaystyle{ \begin{align} 1-d_\mathrm{TV}(P,Q)^2 & = (1-d_\mathrm{TV}(P,Q))(1+d_\mathrm{TV}(P,Q)) \\ & = \left(\int \min(P,Q) \int \max(P,Q)\right)^2 \geq \int \left(\sqrt{\min(P,Q) \max(P,Q)}\right)^2 \\ & = \left(\int \sqrt{PQ}\right)^2 \end{align} }[/math]

Because [math]\displaystyle{ PQ = \min(P,Q) \max(P,Q) }[/math]

  • Step 2:

We write [math]\displaystyle{ (\cdot)^2=\exp(2\log(\cdot)) }[/math] and apply Jensen's inequality:

[math]\displaystyle{ \begin{align} \left(\int \sqrt{PQ}\right)^2 &= \exp\left(2\log\left(\int \sqrt{PQ}\right)^2\right) = \exp\left(2\log\left(\int P\sqrt{\frac{Q}{P}}\right)^2\right) \\ & \exp\left(2\log(\mathbb{E}_P [\sqrt{\frac{P}{Q}}] \right) \geq \exp\left(\mathbb{E}_P\log(\frac{P}{Q}) \right) = \exp(-D_{KL}(P,Q)) \end{align} }[/math]

Combining the results of steps 1 and 2 leads to the claimed bound on the total variation.

Examples of applications

Sample complexity of biased coin tosses[1]

The question is How many coin tosses do I need to distinguish a fair coin from a biased one?

Assume you have 2 coins, a fair coin (Bernoulli distributed with mean [math]\displaystyle{ p_1=1/2 }[/math]) and an [math]\displaystyle{ \epsilon }[/math]-biased coin ([math]\displaystyle{ p_2=1/2+\epsilon }[/math]). Then, in order to identify the biased coin with probability at least [math]\displaystyle{ 1-\delta }[/math] (for some [math]\displaystyle{ \delta\gt 0 }[/math]), at least

[math]\displaystyle{ n\geq \frac{1}{2\epsilon^2}\log\left(\frac{1}{2\delta}\right) }[/math].

In order to obtain this lower bound we impose that the total variation distance between two sequences of [math]\displaystyle{ n }[/math] samples is at least [math]\displaystyle{ 1-2\delta }[/math]. This is because the total variation upper bounds the probability of under- or over-estimating the coins' means. Denote [math]\displaystyle{ P_1^n }[/math] and [math]\displaystyle{ P_2^n }[/math] the respective joint distributions of the [math]\displaystyle{ n }[/math] coin tosses for each coin, then

We have

[math]\displaystyle{ \begin{align} (1-2\delta)^2 & \leq d_\mathrm{TV}\left(P_1^n, P_2^n \right)^2 \\ & \leq 1-e^{-D_\mathrm{KL}(P_1^n \parallel P_2^n)} \\ &= 1-e^{-nD_\mathrm{KL}(P_1 \parallel P_2)}\\ & = 1-e^{-n\frac{\log(1/(1-4\epsilon^2))}{2}} \end{align} }[/math]

The result is obtained by rearranging the terms.

Information-theoretic lower bound for K-armed bandit games[2]

In multi-armed bandit, a lower bound on the minimax regret of any bandit algorithm can be proved using Bretagnolle-Huber and its consequence on hypothesis testing (see Chapter 15 of Bandit Algorithms[2]).

History

The result was first proved in 1979 by Jean Bretagnolle and Catherine Huber (now Huber-Carol), and published in the proceedings of the Strasbourg Probability Seminar.[4] Alexandre Tsybakov's book[3] features an early re-publication of the inequality and its attribution to Bretagnolle and Huber, which is presented as an early and less general version of Assouad's lemma (see notes 2.8). A constant improvement on Bretagnolle-Huber was proved in 2014 as a consequence of an extension of Fano's Inequality.[5]

See also

References

  1. 1.0 1.1 1.2 Canonne, Clément (2022). "A short note on an inequality between KL and TV". arXiv:2202.07198 [math.PR].
  2. 2.0 2.1 2.2 2.3 Lattimore, Tor; Szepesvari, Csaba (2020). Bandit Algorithms. Cambridge University Press. https://tor-lattimore.com/downloads/book/book.pdf. Retrieved 18 August 2022. 
  3. 3.0 3.1 Tsybakov, Alexandre B. (2010). Introduction to nonparametric estimation. Springer Series in Statistics. Springer. doi:10.1007/b13794. ISBN 978-1-4419-2709-5. OCLC 757859245. https://link.springer.com/book/10.1007/b13794. 
  4. 4.0 4.1 Bretagnolle, J.; Huber, C. (1978), "Estimation des densités : Risque minimax", Séminaire de Probabilités XII, Lecture notes in Mathematics (Berlin, Heidelberg: Springer Berlin Heidelberg): pp. 342–363, doi:10.1007/bfb0064610, ISBN 978-3-540-08761-8, http://dx.doi.org/10.1007/bfb0064610, retrieved 2022-08-20 
  5. Gerchinovitz, Sébastien; Ménard, Pierre; Stoltz, Gilles (2020-05-01). "Fano's Inequality for Random Variables". Statistical Science 35 (2). doi:10.1214/19-sts716. ISSN 0883-4237. http://dx.doi.org/10.1214/19-sts716.