Posterior probability

From HandWiki
Short description: Conditional probability distribution used in Bayesian statistics

Posterior probability is a type of conditional probability in Bayesian statistics. In common usage, the term posterior probability refers to the conditional probability [math]\displaystyle{ P(A \mid B) }[/math] of an event [math]\displaystyle{ A }[/math] given [math]\displaystyle{ B }[/math] which comes from an application of Bayes' theorem [math]\displaystyle{ P(A \mid B) = {P(B \mid A) P(A)}/{P(B)} }[/math]. Because Bayes' theorem relates the two conditional probabilities [math]\displaystyle{ P(A\mid B) }[/math] and [math]\displaystyle{ P(B \mid A) }[/math] and is symmetric in [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math], the term posterior is somewhat informal – for example, it may be used to emphasize which of the two conditional probabilities is known and which we are inferring.[1][2]

Similarly, a posterior probability distribution is a conditional probability distribution obtained by applying the distributional form of Bayes' theorem.

Definition in the distributional case

In variational Bayesian methods, the posterior probability is the probability of the parameters [math]\displaystyle{ \theta }[/math] given the evidence [math]\displaystyle{ X }[/math], and is denoted [math]\displaystyle{ p(\theta |X) }[/math].

It contrasts with the likelihood function, which is the probability of the evidence given the parameters: [math]\displaystyle{ p(X|\theta) }[/math].

The two are related as follows:

Given a prior belief that a probability distribution function is [math]\displaystyle{ p(\theta) }[/math] and that the observations [math]\displaystyle{ x }[/math] have a likelihood [math]\displaystyle{ p(x|\theta) }[/math], then the posterior probability is defined as

[math]\displaystyle{ p(\theta|x) = \frac{p(x|\theta)}{p(x)}p(\theta) }[/math][3]

where [math]\displaystyle{ p(x) }[/math] is the normalizing constant and is calculated as

[math]\displaystyle{ p(x) = \int p(x|\theta)p(\theta)d\theta }[/math]

for continuous [math]\displaystyle{ \theta }[/math], or by summing [math]\displaystyle{ p(x|\theta)p(\theta) }[/math] over all possible values of [math]\displaystyle{ \theta }[/math] for discrete [math]\displaystyle{ \theta }[/math].[4]

The posterior probability is therefore proportional to the product Likelihood · Prior probability.


Suppose there is a school having 60% boys and 40% girls as students. The girls wear trousers or skirts in equal numbers; all boys wear trousers. An observer sees a (random) student from a distance; all the observer can see is that this student is wearing trousers. What is the probability this student is a girl? The correct answer can be computed using Bayes' theorem.

The event [math]\displaystyle{ G }[/math] is that the student observed is a girl, and the event [math]\displaystyle{ T }[/math] is that the student observed is wearing trousers. To compute the posterior probability [math]\displaystyle{ P(G|T) }[/math], we first need to know:

  • [math]\displaystyle{ P(G) }[/math], or the probability that the student is a girl regardless of any other information. Since the observer sees a random student, meaning that all students have the same probability of being observed, and the percentage of girls among the students is 40%, this probability equals 0.4.
  • [math]\displaystyle{ P(B) }[/math], or the probability that the student is not a girl (i.e. a boy) regardless of any other information ([math]\displaystyle{ B }[/math] is the complementary event to [math]\displaystyle{ G }[/math]). This is 60%, or 0.6.
  • [math]\displaystyle{ P(T|G) }[/math], or the probability of the student wearing trousers given that the student is a girl. As they are as likely to wear skirts as trousers, this is 0.5.
  • [math]\displaystyle{ P(T|B) }[/math], or the probability of the student wearing trousers given that the student is a boy. This is given as 1.
  • [math]\displaystyle{ P(T) }[/math], or the probability of a (randomly selected) student wearing trousers regardless of any other information. Since [math]\displaystyle{ P(T) = P(T|G)P(G) + P(T|B)P(B) }[/math] (via the law of total probability), this is [math]\displaystyle{ P(T)= 0.5\times0.4 + 1\times0.6 = 0.8 }[/math].

Given all this information, the posterior probability of the observer having spotted a girl given that the observed student is wearing trousers can be computed by substituting these values in the formula:

[math]\displaystyle{ P(G|T) = \frac{P(T|G) P(G)}{P(T)} = \frac{0.5 \times 0.4}{0.8} = 0.25. }[/math]

An intuitive way to solve this is to assume the school has N students. Number of boys = 0.6N and number of girls = 0.4N. If N is sufficiently large, total number of trouser wearers = 0.6N+ 50% of 0.4N. And number of girl trouser wearers = 50% of 0.4N. Therefore, in the population of trousers, girls are (50% of 0.4N)/(0.6N+ 50% of 0.4N) = 25%. In other words, if you separated out the group of trouser wearers, a quarter of that group will be girls. Therefore, if you see trousers, the most you can deduce is that you are looking at a single sample from a subset of students where 25% are girls. And by definition, chance of this random student being a girl is 25%. Every Bayes theorem problem can be solved in this way.


The posterior probability distribution of one random variable given the value of another can be calculated with Bayes' theorem by multiplying the prior probability distribution by the likelihood function, and then dividing by the normalizing constant, as follows:

[math]\displaystyle{ f_{X\mid Y=y}(x)={f_X(x) \mathcal L_{X\mid Y=y}(x) \over {\int_{-\infty}^\infty f_X(u) \mathcal L_{X\mid Y=y}(u)\,du}} }[/math]

gives the posterior probability density function for a random variable [math]\displaystyle{ X }[/math] given the data [math]\displaystyle{ Y=y }[/math], where

  • [math]\displaystyle{ f_X(x) }[/math] is the prior density of [math]\displaystyle{ X }[/math],
  • [math]\displaystyle{ \mathcal L_{X\mid Y=y}(x) = f_{Y\mid X=x}(y) }[/math] is the likelihood function as a function of [math]\displaystyle{ x }[/math],
  • [math]\displaystyle{ \int_{-\infty}^\infty f_X(u) \mathcal L_{X\mid Y=y}(u)\,du }[/math] is the normalizing constant, and
  • [math]\displaystyle{ f_{X\mid Y=y}(x) }[/math] is the posterior density of [math]\displaystyle{ X }[/math] given the data [math]\displaystyle{ Y=y }[/math].

Credible interval

Posterior probability is a conditional probability conditioned on randomly observed data. Hence it is a random variable. For a random variable, it is important to summarize its amount of uncertainty. One way to achieve this goal is to provide a credible interval of the posterior probability.


In classification, posterior probabilities reflect the uncertainty of assessing an observation to particular class, see also Class membership probabilities. While statistical classification methods by definition generate posterior probabilities, Machine Learners usually supply membership values which do not induce any probabilistic confidence. It is desirable to transform or re-scale membership values to class membership probabilities, since they are comparable and additionally more easily applicable for post-processing.

See also


  1. "bayesian - Posterior vs conditional probability". 
  2. "What is the difference between conditional and posterior probability?". 
  3. Christopher M. Bishop (2006). Pattern Recognition and Machine Learning. Springer. pp. 21–24. ISBN 978-0-387-31073-2. 
  4. Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari and Donald B. Rubin (2014). Bayesian Data Analysis. CRC Press. pp. 7. ISBN 978-1-4398-4095-5. 

Further reading