Bretagnolle–Huber inequality
In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] by a concave and bounded function of the Kullback–Leibler divergence [math]\displaystyle{ D_\mathrm{KL}(P \parallel Q) }[/math]. The bound can be viewed as an alternative to the well-known Pinsker's inequality: when [math]\displaystyle{ D_\mathrm{KL}(P \parallel Q) }[/math] is large (larger than 2 for instance.[1]), Pinsker's inequality is vacuous, while Bretagnolle–Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing[2]
Formal statement
Preliminary definitions
Let [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] be two probability distributions on a measurable space [math]\displaystyle{ (\mathcal{X}, \mathcal{F}) }[/math]. Recall that the total variation between [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] is defined by
- [math]\displaystyle{ d_\mathrm{TV}(P,Q) = \sup_{A \in \mathcal{F}} \{|P(A)-Q(A)| \}. }[/math]
The Kullback-Leibler divergence is defined as follows:
- [math]\displaystyle{ D_\mathrm{KL}(P \parallel Q) = \begin{cases} \int_{\mathcal{X}} \log\bigl(\frac{dP}{dQ}\bigr)\, dP & \text{if } P \ll Q, \\[1mm] +\infty & \text{otherwise}. \end{cases} }[/math]
In the above, the notation [math]\displaystyle{ P\ll Q }[/math] stands for absolute continuity of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math], and [math]\displaystyle{ \frac{dP}{dQ} }[/math] stands for the Radon–Nikodym derivative of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math].
General statement
The Bretagnolle–Huber inequality says:
- [math]\displaystyle{ d_\mathrm{TV}(P,Q) \leq \sqrt{1-\exp(-D_\mathrm{KL}(P \parallel Q))} \leq 1 - \frac{1}{2}\exp(-D_\mathrm{KL}(P \parallel Q)) }[/math]
Alternative version
The following version is directly implied by the bound above but some authors[2] prefer stating it this way. Let [math]\displaystyle{ A\in \mathcal{F} }[/math] be any event. Then
- [math]\displaystyle{ P(A) + Q(\bar{A}) \geq \frac{1}{2}\exp(-D_\mathrm{KL}(P \parallel Q)) }[/math]
where [math]\displaystyle{ \bar{A} = \Omega \smallsetminus A }[/math] is the complement of [math]\displaystyle{ A }[/math].
Indeed, by definition of the total variation, for any [math]\displaystyle{ A \in \mathcal{F} }[/math],
- [math]\displaystyle{ \begin{align} Q(A) - P(A) \leq d_\mathrm{TV}(P,Q) & \leq 1- \frac{1}{2}\exp(-D_\mathrm{KL}(P \parallel Q)) \\ & = Q(A) + Q(\bar{A}) - \frac{1}{2}\exp(-D_\mathrm{KL}(P \parallel Q)) \end{align} }[/math]
Rearranging, we obtain the claimed lower bound on [math]\displaystyle{ P(A)+Q(\bar{A}) }[/math].
Proof
We prove the main statement following the ideas in Tsybakov's book (Lemma 2.6, page 89),[3] which differ from the original proof[4] (see C.Canonne's note [1] for a modernized retranscription of their argument).
The proof is in two steps:
1. Prove using Cauchy–Schwarz that the total variation is related to the Bhattacharyya coefficient (right-hand side of the inequality):
- [math]\displaystyle{ 1-d_\mathrm{TV}(P,Q)^2 \geq \left(\int \sqrt{PQ}\right)^2 }[/math]
2. Prove by a clever application of Jensen’s inequality that
- [math]\displaystyle{ \left(\int \sqrt{PQ}\right)^2 \geq \exp(-D_\mathrm{KL}(P \parallel Q)) }[/math]
- Step 1:
- First notice that
- [math]\displaystyle{ d_\mathrm{TV}(P,Q) = 1-\int \min(P,Q) = \int \max(P,Q) -1 }[/math]
- To see this, denote [math]\displaystyle{ A^* = \arg\max_{A\in \Omega} |P(A)-Q(A)| }[/math] and without loss of generality, assume that [math]\displaystyle{ P(A^*)\gt Q(A^*) }[/math] such that [math]\displaystyle{ d_\mathrm{TV}(P,Q)=P(A^*)-Q(A^*) }[/math]. Then we can rewrite
- [math]\displaystyle{ d_\mathrm{TV}(P,Q) = \int_{A^*} \max(P,Q) - \int_{A^*} \min(P,Q) }[/math]
- And then adding and removing [math]\displaystyle{ \int_{\bar{A^*}} \max(P,Q) \text{ or } \int_{\bar{A^*}}\min(P,Q) }[/math] we obtain both identities.
- Then
- [math]\displaystyle{ \begin{align} 1-d_\mathrm{TV}(P,Q)^2 & = (1-d_\mathrm{TV}(P,Q))(1+d_\mathrm{TV}(P,Q)) \\ & = \int \min(P,Q) \int \max(P,Q) \\ & \geq \left(\int \sqrt{\min(P,Q) \max(P,Q)}\right)^2 \\ & = \left(\int \sqrt{PQ}\right)^2 \end{align} }[/math]
- because [math]\displaystyle{ PQ = \min(P,Q) \max(P,Q). }[/math]
- Step 2:
- We write [math]\displaystyle{ (\cdot)^2=\exp(2\log(\cdot)) }[/math] and apply Jensen's inequality:
- [math]\displaystyle{ \begin{align} \left(\int \sqrt{PQ}\right)^2 &= \exp\left(2\log\left(\int \sqrt{PQ}\right)\right) \\ & = \exp\left(2\log\left(\int P\sqrt{\frac{Q}{P}}\right)\right) \\ & =\exp\left(2\log\left(\operatorname{E}_P \left[\left(\sqrt{\frac{P}{Q}}\right)^{-1} \, \right] \right) \right) \\ & \geq \exp\left(\operatorname{E}_P\left[-\log\left(\frac{P}{Q} \right)\right] \right) = \exp(-D_{KL}(P,Q)) \end{align} }[/math]
- Combining the results of steps 1 and 2 leads to the claimed bound on the total variation.
Examples of applications
Sample complexity of biased coin tosses
Source:[1]
The question is How many coin tosses do I need to distinguish a fair coin from a biased one?
Assume you have 2 coins, a fair coin (Bernoulli distributed with mean [math]\displaystyle{ p_1=1/2 }[/math]) and an [math]\displaystyle{ \varepsilon }[/math]-biased coin ([math]\displaystyle{ p_2=1/2+\varepsilon }[/math]). Then, in order to identify the biased coin with probability at least [math]\displaystyle{ 1-\delta }[/math] (for some [math]\displaystyle{ \delta\gt 0 }[/math]), at least
- [math]\displaystyle{ n\geq \frac{1}{2\varepsilon^2}\log\left(\frac{1}{2\delta}\right). }[/math]
In order to obtain this lower bound we impose that the total variation distance between two sequences of [math]\displaystyle{ n }[/math] samples is at least [math]\displaystyle{ 1-2\delta }[/math]. This is because the total variation upper bounds the probability of under- or over-estimating the coins' means. Denote [math]\displaystyle{ P_1^n }[/math] and [math]\displaystyle{ P_2^n }[/math] the respective joint distributions of the [math]\displaystyle{ n }[/math] coin tosses for each coin, then
We have
- [math]\displaystyle{ \begin{align} (1-2\delta)^2 & \leq d_\mathrm{TV}\left(P_1^n, P_2^n \right)^2 \\[4pt] & \leq 1-e^{-D_\mathrm{KL}(P_1^n \parallel P_2^n)} \\[4pt] &= 1-e^{-nD_\mathrm{KL}(P_1 \parallel P_2)}\\[4pt] & = 1-e^{-n\frac{\log(1/(1-4\varepsilon^2))}{2}} \end{align} }[/math]
The result is obtained by rearranging the terms.
Information-theoretic lower bound for k-armed bandit games
In multi-armed bandit, a lower bound on the minimax regret of any bandit algorithm can be proved using Bretagnolle–Huber and its consequence on hypothesis testing (see Chapter 15 of Bandit Algorithms[2]).
History
The result was first proved in 1979 by Jean Bretagnolle and Catherine Huber, and published in the proceedings of the Strasbourg Probability Seminar.[4] Alexandre Tsybakov's book[3] features an early re-publication of the inequality and its attribution to Bretagnolle and Huber, which is presented as an early and less general version of Assouad's lemma (see notes 2.8). A constant improvement on Bretagnolle–Huber was proved in 2014 as a consequence of an extension of Fano's Inequality.[5]
See also
- Total variation for a list of upper bounds
References
- ↑ 1.0 1.1 1.2 Canonne, Clément (2022). "A short note on an inequality between KL and TV". arXiv:2202.07198 [math.PR].
- ↑ 2.0 2.1 2.2 Lattimore, Tor; Szepesvari, Csaba (2020). Bandit Algorithms. Cambridge University Press. https://tor-lattimore.com/downloads/book/book.pdf. Retrieved 18 August 2022.
- ↑ 3.0 3.1 Tsybakov, Alexandre B. (2010). Introduction to nonparametric estimation. Springer Series in Statistics. Springer. doi:10.1007/b13794. ISBN 978-1-4419-2709-5. OCLC 757859245. https://link.springer.com/book/10.1007/b13794.
- ↑ 4.0 4.1 Bretagnolle, J.; Huber, C. (1978), "Estimation des densités : Risque minimax", Séminaire de Probabilités XII, Lecture notes in Mathematics (Berlin, Heidelberg: Springer Berlin Heidelberg): pp. 342–363, doi:10.1007/bfb0064610, ISBN 978-3-540-08761-8, http://dx.doi.org/10.1007/bfb0064610, retrieved 2022-08-20
- ↑ Gerchinovitz, Sébastien; Ménard, Pierre; Stoltz, Gilles (2020-05-01). "Fano's Inequality for Random Variables". Statistical Science 35 (2). doi:10.1214/19-sts716. ISSN 0883-4237. http://dx.doi.org/10.1214/19-sts716.
Original source: https://en.wikipedia.org/wiki/Bretagnolle–Huber inequality.
Read more |