Bayes classifier

From HandWiki

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.[1]

Definition

Suppose a pair [math]\displaystyle{ (X,Y) }[/math] takes values in [math]\displaystyle{ \mathbb{R}^d \times \{1,2,\dots,K\} }[/math], where [math]\displaystyle{ Y }[/math] is the class label of an element whose features are given by [math]\displaystyle{ X }[/math]. Assume that the conditional distribution of X, given that the label Y takes the value r is given by

[math]\displaystyle{ (X\mid Y=r) \sim P_r }[/math] for [math]\displaystyle{ r=1,2,\dots,K }[/math]

where "[math]\displaystyle{ \sim }[/math]" means "is distributed as", and where [math]\displaystyle{ P_r }[/math] denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function [math]\displaystyle{ C: \mathbb{R}^d \to \{1,2,\dots,K\} }[/math], with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as

[math]\displaystyle{ \mathcal{R}(C) = \operatorname{P}\{C(X) \neq Y\}. }[/math]

The Bayes classifier is

[math]\displaystyle{ C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r \mid X=x). }[/math]

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, [math]\displaystyle{ \operatorname{P}(Y=r \mid X=x) }[/math]. The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier [math]\displaystyle{ C }[/math] (possibly depending on some training data) is defined as [math]\displaystyle{ \mathcal{R}(C) - \mathcal{R}(C^\text{Bayes}). }[/math] Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.[2]

Considering the components [math]\displaystyle{ x_i }[/math] of [math]\displaystyle{ x }[/math] to be mutually independent, we get the naive bayes classifier, where [math]\displaystyle{ C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r)\prod_{i=1}^{d}P_r(x_i). }[/math]

Properties

Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk [math]\displaystyle{ R(h) }[/math], Bayes risk [math]\displaystyle{ R^* }[/math], all possible classes to which the points can be classified [math]\displaystyle{ Y = \{0,1\} }[/math]. Let the posterior probability of a point belonging to class 1 be [math]\displaystyle{ \eta(x)=Pr(Y=1|X=x) }[/math]. Define the classifier [math]\displaystyle{ \mathcal{h}^* }[/math]as

[math]\displaystyle{ \mathcal{h}^*(x)=\begin{cases}1&\text{if }\eta(x)\geqslant 0.5,\\ 0&\text{otherwise.}\end{cases} }[/math]

Then we have the following results:

(a) [math]\displaystyle{ R(h^*)=R^* }[/math], i.e. [math]\displaystyle{ h^* }[/math] is a Bayes classifier,

(b) For any classifier [math]\displaystyle{ h }[/math], the excess risk satisfies [math]\displaystyle{ R(h)-R^*=2\mathbb{E}_X\left[|\eta(x)-0.5|\cdot \mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}\right] }[/math]

(c) [math]\displaystyle{ R^* = \mathbb{E}_X\left[\min(\eta(X),1-\eta(X))\right] }[/math]

(d) [math]\displaystyle{ R^* = \frac{1}{2} - \frac{1}{2}\mathbb E [|2\eta(X) - 1|] }[/math]


Proof of (a): For any classifier [math]\displaystyle{ h }[/math], we have

[math]\displaystyle{ R(h) = \mathbb{E}_{XY}\left[ \mathbb{I}_{ \left\{h(X)\ne Y \right\}} \right] }[/math]

[math]\displaystyle{ =\mathbb{E}_X\mathbb{E}_{Y|X}[\mathbb{I}_{ \left\{h(X)\ne Y \right\}} ] }[/math] (due to Fubini's theorem)

[math]\displaystyle{ = \mathbb{E}_X[\eta(X)\mathbb{I}_{ \left\{h(X)=0\right\}} +(1-\eta(X))\mathbb{I}_{ \left\{h(X)=1 \right\}} ] }[/math]

Notice that [math]\displaystyle{ R(h) }[/math] is minimised by taking [math]\displaystyle{ \forall x\in X }[/math],

[math]\displaystyle{ h(x)=\begin{cases}1&\text{if }\eta(x)\geqslant 1-\eta(x),\\ 0&\text{otherwise.}\end{cases} }[/math]

Therefore the minimum possible risk is the Bayes risk, [math]\displaystyle{ R^*= R(h^*) }[/math].


Proof of (b):

[math]\displaystyle{ \begin{aligned} R(h)-R^* &= R(h)-R(h^*)\\ &= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h(X)=1\right\}}-\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}-(1-\eta(X))\mathbb{I}_{\left\{h^*(X)=1\right\}}]\\ &=\mathbb{E}_X[|2\eta(X)-1|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]\\ &= 2\mathbb{E}_X[|\eta(X)-0.5|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}] \end{aligned} }[/math]


Proof of (c):

[math]\displaystyle{ \begin{aligned} R(h^*) &= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h*(X)=1\right\}}]\\ &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \end{aligned} }[/math]

Proof of (d):

[math]\displaystyle{ \begin{aligned} R(h^*) &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \\ &= \frac{1}{2} - \mathbb{E}_X[\max(\eta(X) - 1/2,1/2-\eta(X))]\\ &=\frac{1}{2} - \frac{1}{2} \mathbb E [|2\eta(X) - 1|] \end{aligned} }[/math]

General case

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.

[math]\displaystyle{ \begin{align} \mathbb{E}_Y(\mathbb{I}_{\{y\ne \hat{y}\}}) &= \mathbb{E}_X\mathbb{E}_{Y|X}\left(\mathbb{I}_{\{y\ne \hat{y}\}}|X=x\right)\\ &= \mathbb{E}\left[Pr(Y=1|X=x)\mathbb{I}_{\{\hat{y}=2,3,\dots,n\}}+Pr(Y=2|X=x)\mathbb{I}_{\{\hat{y}=1,3,\dots,n\}}+\dots+Pr(Y=n|X=x)\mathbb{I}_{\{\hat{y}=1,2,3,\dots,n-1\}}\right] \end{align} }[/math]

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier[math]\displaystyle{ h(x)=k,\quad \arg\max_{k}Pr(Y=k|X=x) }[/math]

for each observation x.

See also

References

  1. Devroye, L.; Gyorfi, L.; Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer. ISBN 0-3879-4618-7. 
  2. Farago, A.; Lugosi, G. (1993). "Strong universal consistency of neural network classifiers". IEEE Transactions on Information Theory 39 (4): 1146–1151. doi:10.1109/18.243433. https://dl.acm.org/doi/abs/10.1109/18.243433.