bayes classifier

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.{{cite book |author = Devroye, L. |author2=Gyorfi, L. |author3=Lugosi, G. |name-list-style=amp |year = 1996 |title = A probabilistic theory of pattern recognition| publisher=Springer | isbn=0-3879-4618-7}}

Definition

Suppose a pair $(X,Y)$ takes values in $\mathbb{R}^d \times \{1,2,\dots,K\}$ , where $Y$ is the class label of an element whose features are given by $X$ . Assume that the conditional distribution of X, given that the label Y takes the value r is given by

$(X\mid Y=r) \sim P_r \quad \text{for} \quad r=1,2,\dots,K$

where " $\sim$ " means "is distributed as", and where $P_r$ denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function $C: \mathbb{R}^d \to \{1,2,\dots,K\}$ , with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as

$\mathcal{R}(C) = \operatorname{P}\{C(X) \neq Y\}.$

The Bayes classifier is

$C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r \mid X=x).$

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, $\operatorname{P}(Y=r \mid X=x)$ . The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier $C$ (possibly depending on some training data) is defined as $\mathcal{R}(C) - \mathcal{R}(C^\text{Bayes}).$

Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.{{Cite journal|url=https://dl.acm.org/doi/abs/10.1109/18.243433 | doi = 10.1109/18.243433 | title = Strong universal consistency of neural network classifiers|year = 1993|last1 = Farago|first1 = A.|last2 = Lugosi|first2 = G.|journal = IEEE Transactions on Information Theory|volume = 39|issue = 4|pages = 1146–1151|url-access = subscription}}

Considering the components $x_i$ of $x$ to be mutually independent, we get the naive Bayes classifier, where $C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r)\prod_{i=1}^{d}P_r(x_i).$

Properties

Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk $R(h)$ , Bayes risk $R^*$ , all possible classes to which the points can be classified $Y = \{0,1\}$ . Let the posterior probability of a point belonging to class 1 be $\eta(x)=Pr(Y=1|X=x)$ . Define the classifier $\mathcal{h}^*$ as

$\mathcal{h}^*(x)=\begin{cases}1&\text{if }\eta(x)\geqslant 0.5,\\ 0&\text{otherwise.}\end{cases}$

Then we have the following results:

{{ordered list | list-style-type = lower-alpha

| 1 = $R(h^*)=R^*$ , i.e. $h^*$ is a Bayes classifier,

| 2 = For any classifier $h$ , the excess risk satisfies $R(h)-R^*=2\mathbb{E}_X\left[|\eta(x)-0.5|\cdot \mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}\right]$

| 3 = $R^* = \mathbb{E}_X\left[\min(\eta(X),1-\eta(X))\right]$

| 4 = $R^* = \frac{1}{2} - \frac{1}{2}\mathbb E [|2\eta(X) - 1|]$

}}

Proof of (a): For any classifier $h$ , we have

$\begin{align}
R(h) &= \mathbb{E}_{XY}\left[ \mathbb{I}_{ \left\{h(X)\ne Y \right\}} \right] \\
&=\mathbb{E}_X\mathbb{E}_{Y|X}[\mathbb{I}_{ \left\{h(X)\ne Y \right\}} ] \\
&= \mathbb{E}_X[\eta(X)\mathbb{I}_{ \left\{h(X)=0\right\}} +(1-\eta(X))\mathbb{I}_{ \left\{h(X)=1 \right\}} ]
\end{align}$

where the second line was derived through Fubini's theorem

Notice that $R(h)$ is minimised by taking $\forall x\in X$ ,

$h(x) = \begin{cases}
1&\text{if }\eta(x)\geqslant 1-\eta(x),\\
0&\text{otherwise.}
\end{cases}$

Therefore the minimum possible risk is the Bayes risk, $R^*= R(h^*)$ .

Proof of (b):

$\begin{aligned}
R(h)-R^* &= R(h)-R(h^*)\\
&= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h(X)=1\right\}}-\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}-(1-\eta(X))\mathbb{I}_{\left\{h^*(X)=1\right\}}]\\
&=\mathbb{E}_X[|2\eta(X)-1|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]\\
&= 2\mathbb{E}_X[|\eta(X)-0.5|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]
\end{aligned}$

Proof of (c):

$\begin{aligned}
R(h^*) &= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h*(X)=1\right\}}]\\
&= \mathbb{E}_X[\min(\eta(X),1-\eta(X))]
\end{aligned}$

Proof of (d):

$\begin{aligned}
R(h^*) &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \\
&= \frac{1}{2} - \mathbb{E}_X[\max(\eta(X) - 1/2,1/2-\eta(X))]\\
&=\frac{1}{2} - \frac{1}{2} \mathbb E [|2\eta(X) - 1|]
\end{aligned}$

=General case=

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.

$\begin{align}
\mathbb{E}_Y(\mathbb{I}_{\{y\ne \hat{y}\}}) &= \mathbb{E}_X\mathbb{E}_{Y|X}\left(\mathbb{I}_{\{y\ne \hat{y}\}}|X=x\right)\\
&= \mathbb{E} \left[\Pr(Y=1|X=x)\mathbb{I}_{\{\hat{y}=2,3,\dots,n\}} + \Pr(Y=2|X=x)\mathbb{I}_{\{\hat{y}=1,3,\dots,n\}} + \dots + \Pr(Y=n|X=x) \mathbb{I}_{\{\hat{y}=1,2,3,\dots,n-1\}}\right]
\end{align}$

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier $h(x) = k,\quad \arg\max_{k}Pr(Y=k|X=x)$ for each observation x.

References

Category:Bayesian statistics

Category:Statistical classification

bayes classifier

Definition

Properties

=General case=

See also

References