Law of total expectation

From HandWiki

The proposition in probability theory known as the law of total expectation,[1] the law of iterated expectations[2] (LIE), Adam's law,[3] the tower rule,[4] and the smoothing theorem,[5] among other names, states that if [math]\displaystyle{ X }[/math] is a random variable whose expected value [math]\displaystyle{ \operatorname{E}(X) }[/math] is defined, and [math]\displaystyle{ Y }[/math] is any random variable on the same probability space, then

[math]\displaystyle{ \operatorname{E} (X) = \operatorname{E} ( \operatorname{E} ( X \mid Y)), }[/math]

i.e., the expected value of the conditional expected value of [math]\displaystyle{ X }[/math] given [math]\displaystyle{ Y }[/math] is the same as the expected value of [math]\displaystyle{ X }[/math].

One special case states that if [math]\displaystyle{ {\left\{A_i\right\}}_i }[/math] is a finite or countable partition of the sample space, then

[math]\displaystyle{ \operatorname{E} (X) = \sum_i{\operatorname{E}(X \mid A_i) \operatorname{P}(A_i)}. }[/math]

Note: The conditional expected value E(X | Z) is a random variable whose value depend on the value of Z. Note that the conditional expected value of X given the event Z = z is a function of z. If we write E(X | Z = z) = g(z) then the random variable E(X | Z) is g(Z). Similar comments apply to the conditional covariance.


Suppose that only two factories supply light bulbs to the market. Factory [math]\displaystyle{ X }[/math]'s bulbs work for an average of 5000 hours, whereas factory [math]\displaystyle{ Y }[/math]'s bulbs work for an average of 4000 hours. It is known that factory [math]\displaystyle{ X }[/math] supplies 60% of the total bulbs available. What is the expected length of time that a purchased bulb will work for?

Applying the law of total expectation, we have:

[math]\displaystyle{ \begin{align} \operatorname{E} (L) &= \operatorname{E}(L \mid X) \operatorname{P}(X)+\operatorname{E}(L \mid Y) \operatorname{P}(Y) \\[3pt] &= 5000(0.6)+4000(0.4)\\[2pt] &=4600 \end{align} }[/math]


  • [math]\displaystyle{ \operatorname{E} (L) }[/math] is the expected life of the bulb;
  • [math]\displaystyle{ \operatorname{P}(X)={6 \over 10} }[/math] is the probability that the purchased bulb was manufactured by factory [math]\displaystyle{ X }[/math];
  • [math]\displaystyle{ \operatorname{P}(Y)={4 \over 10} }[/math] is the probability that the purchased bulb was manufactured by factory [math]\displaystyle{ Y }[/math];
  • [math]\displaystyle{ \operatorname{E}(L \mid X)=5000 }[/math] is the expected lifetime of a bulb manufactured by [math]\displaystyle{ X }[/math];
  • [math]\displaystyle{ \operatorname{E}(L \mid Y)=4000 }[/math] is the expected lifetime of a bulb manufactured by [math]\displaystyle{ Y }[/math].

Thus each purchased light bulb has an expected lifetime of 4600 hours.

Proof in the finite and countable cases

Let the random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math], defined on the same probability space, assume a finite or countably infinite set of finite values. Assume that [math]\displaystyle{ \operatorname{E}[X] }[/math] is defined, i.e. [math]\displaystyle{ \min (\operatorname{E}[X_+], \operatorname{E}[X_-]) \lt \infty }[/math]. If [math]\displaystyle{ \{A_i\} }[/math] is a partition of the probability space [math]\displaystyle{ \Omega }[/math], then

[math]\displaystyle{ \operatorname{E} (X) = \sum_i{\operatorname{E}(X \mid A_i) \operatorname{P}(A_i)}. }[/math]


[math]\displaystyle{ \begin{align} \operatorname{E} \left( \operatorname{E} (X \mid Y) \right) &= \operatorname{E} \Bigg[ \sum_x x \cdot \operatorname{P}(X=x \mid Y) \Bigg] \\[6pt] &=\sum_y \Bigg[ \sum_x x \cdot \operatorname{P}(X=x \mid Y=y) \Bigg] \cdot \operatorname{P}(Y=y) \\[6pt] &=\sum_y \sum_x x \cdot \operatorname{P}(X=x, Y=y). \end{align} }[/math]

If the series is finite, then we can switch the summations around, and the previous expression will become

[math]\displaystyle{ \begin{align} \sum_x \sum_y x \cdot \operatorname{P}(X=x, Y=y)&=\sum_x x\sum_y \operatorname{P}(X=x, Y=y)\\[6pt] &=\sum_x x \cdot \operatorname{P}(X=x)\\[6pt] &=\operatorname{E}(X). \end{align} }[/math]

If, on the other hand, the series is infinite, then its convergence cannot be conditional, due to the assumption that [math]\displaystyle{ \min (\operatorname{E}[X_+], \operatorname{E}[X_-] ) \lt \infty. }[/math] The series converges absolutely if both [math]\displaystyle{ \operatorname{E}[X_+] }[/math] and [math]\displaystyle{ \operatorname{E}[X_-] }[/math] are finite, and diverges to an infinity when either [math]\displaystyle{ \operatorname{E}[X_+] }[/math] or [math]\displaystyle{ \operatorname{E}[X_-] }[/math] is infinite. In both scenarios, the above summations may be exchanged without affecting the sum.

Proof in the general case

Let [math]\displaystyle{ (\Omega,\mathcal{F},\operatorname{P}) }[/math] be a probability space on which two sub σ-algebras [math]\displaystyle{ \mathcal{G}_1 \subseteq \mathcal{G}_2 \subseteq \mathcal{F} }[/math] are defined. For a random variable [math]\displaystyle{ X }[/math] on such a space, the smoothing law states that if [math]\displaystyle{ \operatorname{E}[X] }[/math] is defined, i.e. [math]\displaystyle{ \min(\operatorname{E}[X_+], \operatorname{E}[X_-])\lt \infty }[/math], then

[math]\displaystyle{ \operatorname{E}[ \operatorname{E}[X \mid \mathcal{G}_2] \mid \mathcal{G}_1] = \operatorname{E}[X \mid \mathcal{G}_1]\quad\text{(a.s.)}. }[/math]

Proof. Since a conditional expectation is a Radon–Nikodym derivative, verifying the following two properties establishes the smoothing law:

  • [math]\displaystyle{ \operatorname{E}[ \operatorname{E}[X \mid \mathcal{G}_2] \mid \mathcal{G}_1] \mbox{ is } \mathcal{G}_1 }[/math]-measurable
  • [math]\displaystyle{ \int_{G_1} \operatorname{E}[ \operatorname{E}[X \mid \mathcal{G}_2] \mid \mathcal{G}_1] d\operatorname{P} = \int_{G_1} X d\operatorname{P}, }[/math] for all [math]\displaystyle{ G_1 \in \mathcal{G}_1. }[/math]

The first of these properties holds by definition of the conditional expectation. To prove the second one,

[math]\displaystyle{ \begin{align} \min\left(\int_{G_1}X_+\, d\operatorname{P}, \int_{G_1}X_-\, d\operatorname{P}\right) &\leq \min\left(\int_\Omega X_+\, d\operatorname{P}, \int_\Omega X_-\, d\operatorname{P}\right)\\[4pt] &=\min(\operatorname{E}[X_+], \operatorname{E}[X_-]) \lt \infty, \end{align} }[/math]

so the integral [math]\displaystyle{ \textstyle \int_{G_1}X\, d\operatorname{P} }[/math] is defined (not equal [math]\displaystyle{ \infty - \infty }[/math]).

The second property thus holds since [math]\displaystyle{ G_1 \in \mathcal{G}_1 \subseteq \mathcal{G}_2 }[/math] implies

[math]\displaystyle{ \int_{G_1} \operatorname{E}[ \operatorname{E}[X \mid \mathcal{G}_2] \mid \mathcal{G}_1] d\operatorname{P} = \int_{G_1} \operatorname{E}[X \mid \mathcal{G}_2] d\operatorname{P} = \int_{G_1} X d\operatorname{P}. }[/math]

Corollary. In the special case when [math]\displaystyle{ \mathcal{G}_1 = \{\empty,\Omega \} }[/math] and [math]\displaystyle{ \mathcal{G}_2 = \sigma(Y) }[/math], the smoothing law reduces to

[math]\displaystyle{ \operatorname{E}[ \operatorname{E}[X \mid Y]] = \operatorname{E}[X]. }[/math]

Alternative proof for [math]\displaystyle{ \operatorname{E}[ \operatorname{E}[X \mid Y]] = \operatorname{E}[X]. }[/math]

This is a simple consequence of the measure-theoretic definition of conditional expectation. By definition, [math]\displaystyle{ \operatorname{E}[X \mid Y] := \operatorname{E}[X \mid \sigma(Y)] }[/math] is a [math]\displaystyle{ \sigma(Y) }[/math]-measurable random variable that satisfies

[math]\displaystyle{ \int_{A}\operatorname{E}[X \mid Y] d\operatorname{P} = \int_{A} X d\operatorname{P}, }[/math]

for every measurable set [math]\displaystyle{ A \in \sigma(Y) }[/math]. Taking [math]\displaystyle{ A = \Omega }[/math] proves the claim.

Proof of partition formula

[math]\displaystyle{ \begin{align} \sum\limits_i\operatorname{E}(X\mid A_i)\operatorname{P}(A_i) &=\sum\limits_i\int\limits_\Omega X(\omega)\operatorname{P}(d\omega\mid A_i)\cdot\operatorname{P}(A_i)\\ &=\sum\limits_i\int\limits_\Omega X(\omega)\operatorname{P}(d\omega\cap A_i)\\ &=\sum\limits_i\int\limits_\Omega X(\omega)I_{A_i}(\omega)\operatorname{P}(d\omega)\\ &=\sum\limits_i\operatorname{E}(XI_{A_i}), \end{align} }[/math]

where [math]\displaystyle{ I_{A_i} }[/math] is the indicator function of the set [math]\displaystyle{ A_i }[/math].

If the partition [math]\displaystyle{ {\{A_i\}}_{i=0}^n }[/math] is finite, then, by linearity, the previous expression becomes

[math]\displaystyle{ \operatorname{E}\left(\sum\limits_{i=0}^n XI_{A_i}\right)=\operatorname{E}(X), }[/math]

and we are done.

If, however, the partition [math]\displaystyle{ {\{A_i\}}_{i=0}^\infty }[/math] is infinite, then we use the dominated convergence theorem to show that

[math]\displaystyle{ \operatorname{E}\left(\sum\limits_{i=0}^n XI_{A_i}\right)\to\operatorname{E}(X). }[/math]

Indeed, for every [math]\displaystyle{ n\geq 0 }[/math],

[math]\displaystyle{ \left|\sum_{i=0}^n XI_{A_i}\right|\leq |X|I_{\mathop{\bigcup}\limits_{i=0}^n A_i}\leq |X|. }[/math]

Since every element of the set [math]\displaystyle{ \Omega }[/math] falls into a specific partition [math]\displaystyle{ A_i }[/math], it is straightforward to verify that the sequence [math]\displaystyle{ {\left\{\sum_{i=0}^n XI_{A_i}\right\}}_{n=0}^\infty }[/math] converges pointwise to [math]\displaystyle{ X }[/math]. By initial assumption, [math]\displaystyle{ \operatorname{E}|X|\lt \infty }[/math]. Applying the dominated convergence theorem yields the desired result.

See also