bayes classifier

{{Short description|Classification algorithm in statistics}}

{{Bayesian statistics}}

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.{{cite book |author = Devroye, L. |author2=Gyorfi, L. |author3=Lugosi, G. |name-list-style=amp |year = 1996 |title = A probabilistic theory of pattern recognition| publisher=Springer | isbn=0-3879-4618-7}}

Definition

Suppose a pair (X,Y) takes values in \mathbb{R}^d \times \{1,2,\dots,K\}, where Y is the class label of an element whose features are given by X. Assume that the conditional distribution of X, given that the label Y takes the value r is given by

(X\mid Y=r) \sim P_r \quad \text{for} \quad r=1,2,\dots,K

where "\sim" means "is distributed as", and where P_r denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function C: \mathbb{R}^d \to \{1,2,\dots,K\}, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as

\mathcal{R}(C) = \operatorname{P}\{C(X) \neq Y\}.

The Bayes classifier is

C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r \mid X=x).

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, \operatorname{P}(Y=r \mid X=x). The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier C (possibly depending on some training data) is defined as \mathcal{R}(C) - \mathcal{R}(C^\text{Bayes}).

Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.{{Cite journal|url=https://dl.acm.org/doi/abs/10.1109/18.243433 | doi = 10.1109/18.243433 | title = Strong universal consistency of neural network classifiers|year = 1993|last1 = Farago|first1 = A.|last2 = Lugosi|first2 = G.|journal = IEEE Transactions on Information Theory|volume = 39|issue = 4|pages = 1146–1151|url-access = subscription}}

Considering the components x_i of x to be mutually independent, we get the naive Bayes classifier, where C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r)\prod_{i=1}^{d}P_r(x_i).

Properties

Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk R(h), Bayes risk R^*, all possible classes to which the points can be classified Y = \{0,1\}. Let the posterior probability of a point belonging to class 1 be \eta(x)=Pr(Y=1|X=x). Define the classifier \mathcal{h}^*as

\mathcal{h}^*(x)=\begin{cases}1&\text{if }\eta(x)\geqslant 0.5,\\ 0&\text{otherwise.}\end{cases}

Then we have the following results:

{{ordered list | list-style-type = lower-alpha

| 1 = R(h^*)=R^*, i.e. h^* is a Bayes classifier,

| 2 = For any classifier h, the excess risk satisfies R(h)-R^*=2\mathbb{E}_X\left[|\eta(x)-0.5|\cdot \mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}\right]

| 3 = R^* = \mathbb{E}_X\left[\min(\eta(X),1-\eta(X))\right]

| 4 = R^* = \frac{1}{2} - \frac{1}{2}\mathbb E [|2\eta(X) - 1|]

}}

Proof of (a): For any classifier h, we have

\begin{align}

R(h) &= \mathbb{E}_{XY}\left[ \mathbb{I}_{ \left\{h(X)\ne Y \right\}} \right] \\

&=\mathbb{E}_X\mathbb{E}_{Y|X}[\mathbb{I}_{ \left\{h(X)\ne Y \right\}} ] \\

&= \mathbb{E}_X[\eta(X)\mathbb{I}_{ \left\{h(X)=0\right\}} +(1-\eta(X))\mathbb{I}_{ \left\{h(X)=1 \right\}} ]

\end{align}

where the second line was derived through Fubini's theorem

Notice that R(h) is minimised by taking \forall x\in X,

h(x) = \begin{cases}

1&\text{if }\eta(x)\geqslant 1-\eta(x),\\

0&\text{otherwise.}

\end{cases}

Therefore the minimum possible risk is the Bayes risk, R^*= R(h^*).

Proof of (b):

\begin{aligned}

R(h)-R^* &= R(h)-R(h^*)\\

&= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h(X)=1\right\}}-\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}-(1-\eta(X))\mathbb{I}_{\left\{h^*(X)=1\right\}}]\\

&=\mathbb{E}_X[|2\eta(X)-1|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]\\

&= 2\mathbb{E}_X[|\eta(X)-0.5|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]

\end{aligned}

Proof of (c):

\begin{aligned}

R(h^*) &= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h*(X)=1\right\}}]\\

&= \mathbb{E}_X[\min(\eta(X),1-\eta(X))]

\end{aligned}

Proof of (d):

\begin{aligned}

R(h^*) &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \\

&= \frac{1}{2} - \mathbb{E}_X[\max(\eta(X) - 1/2,1/2-\eta(X))]\\

&=\frac{1}{2} - \frac{1}{2} \mathbb E [|2\eta(X) - 1|]

\end{aligned}

=General case=

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.

\begin{align}

\mathbb{E}_Y(\mathbb{I}_{\{y\ne \hat{y}\}}) &= \mathbb{E}_X\mathbb{E}_{Y|X}\left(\mathbb{I}_{\{y\ne \hat{y}\}}|X=x\right)\\

&= \mathbb{E} \left[\Pr(Y=1|X=x)\mathbb{I}_{\{\hat{y}=2,3,\dots,n\}} + \Pr(Y=2|X=x)\mathbb{I}_{\{\hat{y}=1,3,\dots,n\}} + \dots + \Pr(Y=n|X=x) \mathbb{I}_{\{\hat{y}=1,2,3,\dots,n-1\}}\right]

\end{align}

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier h(x) = k,\quad \arg\max_{k}Pr(Y=k|X=x) for each observation x.

See also

References