Proof of Stein's example

From HandWiki
Short description: Mathematical proof

Stein's example is an important result in decision theory which can be stated as

The ordinary decision rule for estimating the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk in dimension at least 3.

The following is an outline of its proof.[1] The reader is referred to the main article for more information.

Sketched proof

The risk function of the decision rule [math]\displaystyle{ d(\mathbf{x}) = \mathbf{x} }[/math] is

[math]\displaystyle{ R(\theta,d) = \operatorname{E}_\theta[ |\mathbf{\theta - X}|^2] }[/math]
[math]\displaystyle{ =\int (\mathbf{\theta - x})^T(\mathbf{\theta - x}) \left( \frac{1}{2\pi} \right)^{n/2} e^{(-1/2) (\mathbf{\theta - x})^T (\mathbf{\theta - x}) } m(dx) }[/math]
[math]\displaystyle{ = n. }[/math]

Now consider the decision rule

[math]\displaystyle{ d'(\mathbf{x}) = \mathbf{x} - \frac{\alpha}{|\mathbf{x}|^2}\mathbf{x} }[/math]

where [math]\displaystyle{ \alpha = n-2 }[/math]. We will show that [math]\displaystyle{ d' }[/math] is a better decision rule than [math]\displaystyle{ d }[/math]. The risk function is

[math]\displaystyle{ R(\theta,d') = \operatorname{E}_\theta\left[ \left|\mathbf{\theta - X} + \frac{\alpha}{|\mathbf{X}|^2}\mathbf{X}\right|^2\right] }[/math]
[math]\displaystyle{ = \operatorname{E}_\theta\left[ |\mathbf{\theta - X}|^2 + 2(\mathbf{\theta - X})^T\frac{\alpha}{|\mathbf{X}|^2}\mathbf{X} + \frac{\alpha^2}{|\mathbf{X}|^4}|\mathbf{X}|^2 \right] }[/math]
[math]\displaystyle{ = \operatorname{E}_\theta\left[ |\mathbf{\theta - X}|^2 \right] + 2\alpha\operatorname{E}_\theta\left[\frac{\mathbf{(\theta-X)^T X}}{|\mathbf{X}|^2}\right] + \alpha^2\operatorname{E}_\theta\left[\frac{1}{|\mathbf{X}|^2} \right] }[/math]

— a quadratic in [math]\displaystyle{ \alpha }[/math]. We may simplify the middle term by considering a general "well-behaved" function [math]\displaystyle{ h:\mathbf{x} \mapsto h(\mathbf{x}) \in \mathbb{R} }[/math] and using integration by parts. For [math]\displaystyle{ 1\leq i \leq n }[/math], for any continuously differentiable [math]\displaystyle{ h }[/math] growing sufficiently slowly for large [math]\displaystyle{ x_i }[/math] we have:

[math]\displaystyle{ \operatorname{E}_\theta [ (\theta_i - X_i) h(\mathbf{X}) | X_j=x_j (j\neq i) ]= \int (\theta_i - x_i) h(\mathbf{x}) \left( \frac{1}{2\pi} \right)^{n/2} e^{ -(1/2)\mathbf{(x-\theta)}^T \mathbf{(x-\theta)} } m(dx_i) }[/math]
[math]\displaystyle{ = \left[ h(\mathbf{x}) \left( \frac{1}{2\pi} \right)^{n/2} e^{-(1/2) \mathbf{(x-\theta)}^T \mathbf{(x-\theta)} } \right]^\infty_{x_i=-\infty} - \int \frac{\partial h}{\partial x_i}(\mathbf{x}) \left( \frac{1}{2\pi} \right)^{n/2} e^{-(1/2)\mathbf{(x-\theta)}^T \mathbf{(x-\theta)} } m(dx_i) }[/math]
[math]\displaystyle{ = - \operatorname{E}_\theta \left[ \frac{\partial h}{\partial x_i}(\mathbf{X}) | X_j=x_j (j\neq i) \right]. }[/math]

Therefore,

[math]\displaystyle{ \operatorname{E}_\theta [ (\theta_i - X_i) h(\mathbf{X})]= - \operatorname{E}_\theta \left[ \frac{\partial h}{\partial x_i}(\mathbf{X}) \right]. }[/math]

(This result is known as Stein's lemma.)

Now, we choose

[math]\displaystyle{ h(\mathbf{x}) = \frac{x_i}{|\mathbf{x}|^2}. }[/math]

If [math]\displaystyle{ h }[/math] met the "well-behaved" condition (it doesn't, but this can be remedied—see below), we would have

[math]\displaystyle{ \frac{\partial h}{\partial x_i} = \frac{1}{|\mathbf{x}|^2} - \frac{2 x_i^2}{|\mathbf{x}|^4} }[/math]

and so

[math]\displaystyle{ \operatorname{E}_\theta\left[\frac{\mathbf{(\theta-X)^T X}}{|\mathbf{X}|^2}\right] = \sum_{i=1}^n \operatorname{E}_\theta \left[ (\theta_i - X_i) \frac{X_i}{|\mathbf{X}|^2} \right] }[/math]
[math]\displaystyle{ = - \sum_{i=1}^n \operatorname{E}_\theta \left[ \frac{1}{|\mathbf{X}|^2} - \frac{2 X_i^2}{|\mathbf{X}|^4} \right] }[/math]
[math]\displaystyle{ = -(n-2)\operatorname{E}_\theta \left[\frac{1}{|\mathbf{X}|^2}\right]. }[/math]

Then returning to the risk function of [math]\displaystyle{ d' }[/math]:

[math]\displaystyle{ R(\theta,d') = n - 2\alpha(n-2)\operatorname{E}_\theta\left[\frac{1}{|\mathbf{X}|^2}\right] + \alpha^2\operatorname{E}_\theta\left[\frac{1}{|\mathbf{X}|^2} \right]. }[/math]

This quadratic in [math]\displaystyle{ \alpha }[/math] is minimized at

[math]\displaystyle{ \alpha = n-2, }[/math]

giving

[math]\displaystyle{ R(\theta,d') = R(\theta,d) - (n-2)^2\operatorname{E}_\theta\left[\frac{1}{|\mathbf{X}|^2} \right] }[/math]

which of course satisfies

[math]\displaystyle{ R(\theta,d') \lt R(\theta,d). }[/math]

making [math]\displaystyle{ d }[/math] an inadmissible decision rule.

It remains to justify the use of

[math]\displaystyle{ h(\mathbf{X})= \frac{\mathbf{X}}{|\mathbf{X}|^2}. }[/math]

This function is not continuously differentiable, since it is singular at [math]\displaystyle{ \mathbf{x}=0 }[/math]. However, the function

[math]\displaystyle{ h(\mathbf{X}) = \frac{\mathbf{X}}{\varepsilon + |\mathbf{X}|^2} }[/math]

is continuously differentiable, and after following the algebra through and letting [math]\displaystyle{ \varepsilon \to 0 }[/math], one obtains the same result.


References

  1. Samworth, Richard (December 2012). "Stein's Paradox". Eureka 62: 38–41. http://www.statslab.cam.ac.uk/~rjs57/SteinParadox.pdf.