Kernel Fisher discriminant analysis

In statistics, kernel Fisher discriminant analysis (KFD),{{cite book|last=Mika|first=S |author2=Rätsch, G. |author3=Weston, J. |author4=Schölkopf, B. |author5=Müller, KR |date=1999 |title=Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468) |chapter=Fisher discriminant analysis with kernels |author5-link=Klaus-Robert Müller |volume=IX|pages=41–48|doi=10.1109/NNSP.1999.788121|isbn=978-0-7803-5673-3 |citeseerx=10.1.1.35.9904 |s2cid=8473401 }} also known as generalized discriminant analysis{{cite journal|last=Baudat|first=G.|author2=Anouar, F. |title=Generalized discriminant analysis using a kernel approach|journal=Neural Computation|year=2000|volume=12|issue=10|pages=2385–2404|doi=10.1162/089976600300014980|citeseerx=10.1.1.412.760|pmid=11032039|s2cid=7036341 }} and kernel discriminant analysis,{{cite journal|last=Li|first=Y.|author2=Gong, S. |author3=Liddell, H. |title=Recognising trajectories of facial identities using kernel discriminant analysis|journal=Image and Vision Computing|year=2003|volume=21|issue=13–14|pages=1077–1086|doi=10.1016/j.imavis.2003.08.010|citeseerx=10.1.1.2.6315}} is a kernelized version of linear discriminant analysis (LDA). It is named after Ronald Fisher.

Linear discriminant analysis

Intuitively, the idea of LDA is to find a projection where class separation is maximized. Given two sets of labeled data, \mathbf{C}_1 and \mathbf{C}_2, we can calculate the mean value of each class, \mathbf{m}_1 and \mathbf{m}_2, as

:

\mathbf{m}_i = \frac{1}{l_i}\sum_{n=1}^{l_i}\mathbf{x}_n^i,

where l_i is the number of examples of class \mathbf{C}_i. The goal of linear discriminant analysis is to give a large separation of the class means while also keeping the in-class variance small.{{cite book|last=Bishop|first=CM|title=Pattern Recognition and Machine Learning|year=2006|publisher=Springer|location=New York, NY}} This is formulated as maximizing, with respect to \mathbf{w}, the following ratio:

:

J(\mathbf{w}) = \frac{\mathbf{w}^{\text{T}}\mathbf{S}_B\mathbf{w}}{\mathbf{w}^{\text{T}}\mathbf{S}_W\mathbf{w}},

where \mathbf{S}_B is the between-class covariance matrix and \mathbf{S}_W is the total within-class covariance matrix:

:

\begin{align}

\mathbf{S}_B & = (\mathbf{m}_2-\mathbf{m}_1)(\mathbf{m}_2-\mathbf{m}_1)^{\text{T}} \\

\mathbf{S}_W & = \sum_{i=1,2}\sum_{n=1}^{l_i}(\mathbf{x}_n^i-\mathbf{m}_i)(\mathbf{x}_n^i-\mathbf{m}_i)^{\text{T}}.

\end{align}

The maximum of the above ratio is attained at

:

\mathbf{w} \propto \mathbf{S}^{-1}_W(\mathbf{m}_2-\mathbf{m}_1).

as can be shown by the Lagrange multiplier method (sketch of proof):

Maximizing J(\mathbf{w}) = \frac{\mathbf{w}^{\text{T}}\mathbf{S}_B\mathbf{w}}{\mathbf{w}^{\text{T}}\mathbf{S}_W\mathbf{w}} is equivalent to maximizing

: \mathbf{w}^{\text{T}}\mathbf{S}_B\mathbf{w}

subject to

: \mathbf{w}^{\text{T}}\mathbf{S}_W\mathbf{w} = 1 .

This, in turn, is equivalent to maximizing I(\mathbf{w}, \lambda) = \mathbf{w}^{\text{T}}\mathbf{S}_B\mathbf{w} - \lambda (\mathbf{w}^{\text{T}}\mathbf{S}_W\mathbf{w} - 1) , where \lambda is the Lagrange multiplier.

At the maximum, the derivatives of I(\mathbf{w}, \lambda) with respect to \mathbf{w} and \lambda must be zero. Taking \frac{dI}{d\mathbf{w}} = \mathbf{0} yields

: \mathbf{S}_B\mathbf{w} - \lambda \mathbf{S}_W \mathbf{w} = \mathbf{0},

which is trivially satisfied by \mathbf{w} = c \mathbf{S}^{-1}_W(\mathbf{m}_2-\mathbf{m}_1) and \lambda = (\mathbf{m}_2-\mathbf{m}_1)^{\text{T}} \mathbf{S}^{-1}_W(\mathbf{m}_2-\mathbf{m}_1).

Extending LDA

To extend LDA to non-linear mappings, the data, given as the \ell points \mathbf{x}_i, can be mapped to a new feature space, F, via some function \phi. In this new feature space, the function that needs to be maximized is

: J(\mathbf{w}) = \frac{\mathbf{w}^{\text{T}}\mathbf{S}_B^{\phi}\mathbf{w}}{\mathbf{w}^{\text{T}}\mathbf{S}_W^{\phi}\mathbf{w}},

where

:

\begin{align}

\mathbf{S}_B^{\phi} & = \left (\mathbf{m}_2^{\phi}-\mathbf{m}_1^{\phi} \right ) \left (\mathbf{m}_2^{\phi}-\mathbf{m}_1^{\phi} \right )^{\text{T}} \\

\mathbf{S}_W^{\phi} & = \sum_{i=1,2} \sum_{n=1}^{l_i} \left (\phi(\mathbf{x}_n^i)-\mathbf{m}_i^{\phi} \right ) \left (\phi(\mathbf{x}_n^i)-\mathbf{m}_i^{\phi} \right)^{\text{T}},

\end{align}

and

: \mathbf{m}_i^{\phi} = \frac{1}{l_i}\sum_{j=1}^{l_i}\phi(\mathbf{x}_j^i).

Further, note that \mathbf{w}\in F. Explicitly computing the mappings \phi(\mathbf{x}_i) and then performing LDA can be computationally expensive, and in many cases intractable. For example, F may be infinite dimensional. Thus, rather than explicitly mapping the data to F, the data can be implicitly embedded by rewriting the algorithm in terms of dot products and using kernel functions in which the dot product in the new feature space is replaced by a kernel function,k(\mathbf{x},\mathbf{y}) =\phi( \mathbf{x}) \cdot\phi(\mathbf{y}).

LDA can be reformulated in terms of dot products by first noting that \mathbf{w} will have an expansion of

the form{{cite book|last=Scholkopf|first=B|author2=Herbrich, R. |author3=Smola, A. |title=Computational Learning Theory|chapter=A Generalized Representer Theorem|volume=2111|pages=416–426|year=2001|doi=10.1007/3-540-44581-1_27|citeseerx=10.1.1.42.8617|series=Lecture Notes in Computer Science|isbn=978-3-540-42343-0}}

: \mathbf{w} = \sum_{i=1}^l\alpha_i\phi(\mathbf{x}_i).

Then note that

: \mathbf{w}^{\text{T}}\mathbf{m}_i^{\phi} = \frac{1}{l_i}\sum_{j=1}^{l}\sum_{k=1}^{l_i}\alpha_jk \left (\mathbf{x}_j,\mathbf{x}_k^i \right ) = \mathbf{\alpha}^{\text{T}}\mathbf{M}_i,

where

: (\mathbf{M}_i)_j = \frac{1}{l_i}\sum_{k=1}^{l_i}k(\mathbf{x}_j,\mathbf{x}_k^i).

The numerator of J(\mathbf{w}) can then be written as:

: \mathbf{w}^{\text{T}}\mathbf{S}_B^{\phi}\mathbf{w} = \mathbf{w}^{\text{T}} \left (\mathbf{m}_2^{\phi}-\mathbf{m}_1^{\phi} \right ) \left (\mathbf{m}_2^{\phi}-\mathbf{m}_1^{\phi} \right )^{\text{T}}\mathbf{w} = \mathbf{\alpha}^{\text{T}}\mathbf{M}\mathbf{\alpha}, \qquad \text{where} \qquad \mathbf{M} = (\mathbf{M}_2-\mathbf{M}_1)(\mathbf{M}_2-\mathbf{M}_1)^{\text{T}}.

Similarly, the denominator can be written as

: \mathbf{w}^{\text{T}}\mathbf{S}_W^{\phi}\mathbf{w}=\mathbf{\alpha}^{\text{T}}\mathbf{N}\mathbf{\alpha},\qquad \text{where} \qquad \mathbf{N} = \sum_{j=1,2}\mathbf{K}_j(\mathbf{I}-\mathbf{1}_{l_j})\mathbf{K}_j^{\text{T}},

with the n^{\text{th}}, m^{\text{th}} component of \mathbf{K}_j defined as k(\mathbf{x}_n,\mathbf{x}_m^j), \mathbf{I} is the identity matrix, and \mathbf{1}_{l_j} the matrix with all entries equal to 1/l_j. This identity can be derived by starting out with the expression for \mathbf{w}^{\text{T}} \mathbf{S}_W^{\phi}\mathbf{w} and using the expansion of \mathbf{w} and the definitions of \mathbf{S}_W^{\phi} and \mathbf{m}_i^{\phi}

:

\begin{align}

\mathbf{w}^{\text{T}}\mathbf{S}_W^{\phi}\mathbf{w}

&= \left(\sum_{i=1}^l\alpha_i\phi^{\text{T}}(\mathbf{x}_i)\right)\left(\sum_{j=1,2}\sum_{n =1}^{l_j} \left (\phi(\mathbf{x}_n^j)-\mathbf{m}_j^{\phi} \right ) \left (\phi(\mathbf{x}_n^j)-\mathbf{m}_j^{\phi} \right )^{\text{T}}\right) \left(\sum_{k=1}^l\alpha_k\phi(\mathbf{x}_k)\right)\\

&= \sum_{j=1,2}\sum_{i=1}^l\sum_{n =1}^{l_j}\sum_{k=1}^l \left (\alpha_i\phi^{\text{T}} (\mathbf{x}_i) \left (\phi(\mathbf{x}_n^j)-\mathbf{m}_j^{\phi} \right ) \left (\phi(\mathbf{x}_n^j)-\mathbf{m}_j^{\phi} \right )^{\text{T}} \alpha_k\phi(\mathbf{x}_k) \right ) \\

& = \sum_{j=1,2}\sum_{i=1}^l\sum_{n =1}^{l_j}\sum_{k=1}^l \left(\alpha_ik(\mathbf{x}_i,\mathbf{x}_n^j)-\frac{1}{l_j}\sum_{p=1}^{l_j}\alpha_ik(\mathbf{x}_i,\mathbf{x}_p^j)\right)

\left(\alpha_kk(\mathbf{x}_k,\mathbf{x}_n^j)-\frac{1}{l_j}\sum_{q=1}^{l_j}\alpha_kk(\mathbf{x}_k,\mathbf{x}_q^j)\right) \\

& = \sum_{j=1,2}\left( \sum_{i=1}^l\sum_{n =1}^{l_j}\sum_{k=1}^l \left ( \alpha_i\alpha_kk(\mathbf{x}_i,\mathbf{x}_n^j)k(\mathbf{x}_k,\mathbf{x}_n^j) - \frac{2\alpha_i\alpha_k}{l_j}\sum_{p=1}^{l_j}k(\mathbf{x}_i,\mathbf{x}_n^j)k(\mathbf{x}_k,\mathbf{x}_p^j) + \frac{\alpha_i\alpha_k}{l_j^2}\sum_{p=1}^{l_j}\sum_{q=1}^{l_j} k(\mathbf{x}_i,\mathbf{x}_p^j)k(\mathbf{x}_k,\mathbf{x}_q^j) \right) \right )\\

& = \sum_{j=1,2}\left( \sum_{i=1}^l\sum_{n =1}^{l_j}\sum_{k=1}^l\left( \alpha_i\alpha_kk(\mathbf{x}_i,\mathbf{x}_n^j)k(\mathbf{x}_k,\mathbf{x}_n^j) - \frac{\alpha_i\alpha_k}{l_j}\sum_{p=1}^{l_j}k(\mathbf{x}_i,\mathbf{x}_n^j)k(\mathbf{x}_k,\mathbf{x}_p^j) \right)\right) \\[6pt]

& = \sum_{j=1,2} \mathbf{\alpha}^{\text{T}} \mathbf{K}_j\mathbf{K}_j^{\text{T}}\mathbf{\alpha} - \mathbf{\alpha}^{\text{T}} \mathbf{K}_j\mathbf{1}_{l_j}\mathbf{K}_j^{\text{T}}\mathbf{\alpha} \\[4pt]

& = \mathbf{\alpha}^{\text{T}}\mathbf{N}\mathbf{\alpha}.

\end{align}

With these equations for the numerator and denominator of J(\mathbf{w}), the equation for J can be rewritten as

: J(\mathbf{\alpha}) = \frac{\mathbf{\alpha}^{\text{T}}\mathbf{M}\mathbf{\alpha}}{\mathbf{\alpha}^{\text{T}}\mathbf{N}\mathbf{\alpha}}.

Then, differentiating and setting equal to zero gives

: (\mathbf{\alpha}^{\text{T}}\mathbf{M}\mathbf{\alpha})\mathbf{N}\mathbf{\alpha} = (\mathbf{\alpha}^{\text{T}}\mathbf{N}\mathbf{\alpha})\mathbf{M}\mathbf{\alpha}.

Since only the direction of \mathbf{w}, and hence the direction of \mathbf{\alpha}, matters, the above can be solved for \mathbf{\alpha} as

: \mathbf{\alpha} = \mathbf{N}^{-1}(\mathbf{M}_2- \mathbf{M}_1).

Note that in practice, \mathbf{N} is usually singular and so a multiple of the identity is added to it

: \mathbf{N}_{\epsilon} = \mathbf{N}+\epsilon\mathbf{I}.

Given the solution for \mathbf{\alpha}, the projection of a new data point is given by

: y(\mathbf{x}) = (\mathbf{w}\cdot\phi(\mathbf{x})) = \sum_{i=1}^l\alpha_ik(\mathbf{x}_i,\mathbf{x}).

Multi-class KFD

The extension to cases where there are more than two classes is relatively straightforward.{{cite book|last=Duda|first=R. |author2=Hart, P. |author3=Stork, D.|title=Pattern Classification|year=2001|publisher=Wiley|location=New York, NY}}{{cite journal|last=Zhang|first=J.|author2=Ma, K.K.|title=Kernel fisher discriminant for texture classification|year=2004}} Let c be the number of classes. Then multi-class KFD involves projecting the data into a (c-1)-dimensional space using (c-1) discriminant functions

:

y_i = \mathbf{w}_i^{\text{T}}\phi(\mathbf{x}) \qquad i= 1,\ldots,c-1.

This can be written in matrix notation

:

\mathbf{y} = \mathbf{W}^{\text{T}}\phi(\mathbf{x}),

where the \mathbf{w}_i are the columns of \mathbf{W}. Further, the between-class covariance matrix is now

:

\mathbf{S}_B^{\phi} = \sum_{i=1}^c l_i(\mathbf{m}_i^{\phi}-\mathbf{m}^{\phi})(\mathbf{m}_i^{\phi}-\mathbf{m}^{\phi})^{\text{T}},

where \mathbf{m}^\phi is the mean of all the data in the new feature space. The within-class covariance matrix is

:

\mathbf{S}_W^{\phi} = \sum_{i=1}^c \sum_{n=1}^{l_i}(\phi(\mathbf{x}_n^i)-\mathbf{m}_i^{\phi})(\phi(\mathbf{x}_n^i)-\mathbf{m}_i^{\phi})^{\text{T}},

The solution is now obtained by maximizing

:

J(\mathbf{W}) = \frac{\left|\mathbf{W}^{\text{T}}\mathbf{S}_B^{\phi}\mathbf{W}\right|}{\left|\mathbf{W}^{\text{T}}\mathbf{S}_W^{\phi}\mathbf{W}\right|}.

The kernel trick can again be used and the goal of multi-class KFD becomes

:

\mathbf{A}^* = \underset{\mathbf{A}}{\operatorname{argmax}} = \frac{\left|\mathbf{A}^{\text{T}}\mathbf{M}\mathbf{A}\right|}{\left|\mathbf{A}^{\text{T}}\mathbf{N}\mathbf{A}\right|},

where A = [\mathbf{\alpha}_1,\ldots,\mathbf{\alpha}_{c-1}] and

:

\begin{align}

M & = \sum_{j=1}^cl_j(\mathbf{M}_j-\mathbf{M}_{*})(\mathbf{M}_j-\mathbf{M}_{*})^{\text{T}} \\

N & = \sum_{j=1}^c\mathbf{K}_j(\mathbf{I}-\mathbf{1}_{l_j})\mathbf{K}_j^{\text{T}}.

\end{align}

The \mathbf{M}_i are defined as in the above section and \mathbf{M}_{*} is defined as

:

(\mathbf{M}_{*})_j = \frac{1}{l}\sum_{k=1}^{l}k(\mathbf{x}_j,\mathbf{x}_k).

\mathbf{A}^{*} can then be computed by finding the (c-1) leading eigenvectors of \mathbf{N}^{-1}\mathbf{M}. Furthermore, the projection of a new input, \mathbf{x}_t, is given by

:

\mathbf{y}(\mathbf{x}_t) = \left(\mathbf{A}^{*}\right)^{\text{T}}\mathbf{K}_t,

where the i^{th} component of \mathbf{K}_t is given by k(\mathbf{x}_i,\mathbf{x}_t).

Classification using KFD

In both two-class and multi-class KFD, the class label of a new input can be assigned as

:

f(\mathbf{x}) = arg\min_j D(\mathbf{y}(\mathbf{x}),\bar{\mathbf{y}}_j),

where \bar{\mathbf{y}}_j is the projected mean for class j and D(\cdot,\cdot) is a distance function.

Applications

Kernel discriminant analysis has been used in a variety of applications. These include:

  • Face recognition{{cite journal|last=Liu|first=Q.|author2=Lu, H. |author3=Ma, S. |title=Improving kernel Fisher discriminant analysis for face recognition|journal=IEEE Transactions on Circuits and Systems for Video Technology|year=2004|volume=14|issue=1|pages=42–49|doi=10.1109/tcsvt.2003.818352|s2cid=39657721 }}{{cite journal|last=Liu|first=Q.|author2=Huang, R. |author3=Lu, H. |author4=Ma, S. |title=Face recognition using kernel-based Fisher discriminant analysis|journal=IEEE International Conference on Automatic Face and Gesture Recognition|year=2002}} and detection{{cite book|last=Kurita|first=T.|author2=Taguchi, T. |title=Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition |chapter=A modification of kernel-based Fisher discriminant analysis for face detection |pages=300–305|year=2002|doi=10.1109/AFGR.2002.1004170|citeseerx=10.1.1.100.3568|isbn=978-0-7695-1602-8|s2cid=7581426 }}{{cite journal|last=Feng|first=Y.|author2=Shi, P. |title=Face detection based on kernel fisher discriminant analysis|journal=IEEE International Conference on Automatic Face and Gesture Recognition|year=2004}}
  • Hand-written digit recognition{{cite journal|last=Yang|first=J.|author2=Frangi, AF |author3=Yang, JY |author4= Zang, D., Jin, Z. |title=KPCA plus LDA: a complete kernel Fisher discriminant framework for feature extraction and recognition|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|year=2005|volume=27|issue=2|pages=230–244|doi=10.1109/tpami.2005.33 |pmid=15688560|citeseerx=10.1.1.330.1179|s2cid=9771368 }}
  • Palmprint recognition{{cite journal|last=Wang|first=Y.|author2=Ruan, Q. |title=Kernel fisher discriminant analysis for palmprint recognition|journal=International Conference on Pattern Recognition|year=2006}}
  • Classification of malignant and benign cluster microcalcifications{{cite journal|last=Wei|first=L.|author2=Yang, Y. |author3=Nishikawa, R.M. |author4= Jiang, Y. |title=A study on several machine-learning methods for classification of malignant and benign clustered microcalcifications|journal=IEEE Transactions on Medical Imaging|year=2005|volume=24|issue=3|pages=371–380|doi=10.1109/tmi.2004.842457|pmid=15754987 |s2cid=36691320 }}
  • Seed classification
  • Search for the Higgs Boson at CERN{{cite journal|last=Malmgren|first=T.|title=An iterative nonlinear discriminant analysis program: IDA 1.0|journal=Computer Physics Communications|year=1997|volume=106|issue=3|pages=230–236|doi=10.1016/S0010-4655(97)00100-8|bibcode=1997CoPhC.106..230M }}

See also

References

{{Reflist}}