f-divergence

{{Short description|Function that measures dissimilarity between two probability distributions}}

{{more footnotes|date=September 2015}}

{{DISPLAYTITLE:f-divergence}}

In probability theory, an f-divergence is a certain type of function D_f(P\| Q) that measures the difference between two probability distributions P and Q. Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence.

History

These divergences were introduced by Alfréd Rényi{{cite conference |title= On measures of entropy and information|first= Alfréd|last= Rényi |year= 1961|conference=The 4th Berkeley Symposium on Mathematics, Statistics and Probability, 1960| publisher= University of California Press|location= Berkeley, CA|pages= 547–561|url= http://digitalassets.lib.berkeley.edu/math/ucb/text/math_s4_v1_article-27.pdf}} Eq. (4.20) in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov processes. f-divergences were studied further independently by {{harvtxt|Csiszár|1963}}, {{harvtxt|Morimoto|1963}} and {{harvtxt|Ali|Silvey|1966}} and are sometimes known as Csiszár f-divergences, Csiszár–Morimoto divergences, or Ali–Silvey distances.

Definition

= Non-singular case =

Let P and Q be two probability distributions over a space \Omega, such that P\ll Q, that is, P is absolutely continuous with respect to Q (meaning Q>0 wherever P>0). Then, for a convex function f: [0, +\infty)\to(-\infty, +\infty] such that f(x) is finite for all x > 0, f(1)=0, and f(0)=\lim_{t\to 0^+} f(t) (which could be infinite), the f-divergence of P from Q is defined as

: D_f(P\parallel Q) \equiv \int_{\Omega} f\left(\frac{dP}{dQ}\right)\,dQ.

We call f the generator of D_f.

In concrete applications, there is usually a reference distribution \mu on \Omega (for example, when \Omega = \R^n, the reference distribution is the Lebesgue measure), such that P, Q \ll \mu, then we can use Radon–Nikodym theorem to take their probability densities p and q, giving

: D_f(P\parallel Q) = \int_{\Omega} f\left(\frac{p(x)}{q(x)}\right)q(x)\,d\mu(x).

When there is no such reference distribution ready at hand, we can simply define \mu = P+Q, and proceed as above. This is a useful technique in more abstract proofs.

= Extension to [[singular measures]] =

The above definition can be extended to cases where P\ll Q is no longer satisfied (Definition 7.1 of {{Cite book |last1=Polyanskiy |first1=Yury |url=https://people.lids.mit.edu/yp/homepage/data/itbook-export.pdf |title=Information Theory: From Coding to Learning (draft of October 20, 2022) |last2=Yihong |first2=Wu |publisher=Cambridge University Press |year=2022 |archive-url=https://web.archive.org/web/20230201222557/https://people.lids.mit.edu/yp/homepage/data/itbook-export.pdf |archive-date=2023-02-01}}).

Since f is convex, and f(1) = 0 , the function \frac{f(x)}{x-1} must be nondecreasing, so there exists f'(\infty) := \lim_{x\to\infty}f(x)/x, taking value in (-\infty, +\infty].

Since for any p(x)>0, we have \lim_{q(x)\to 0} q(x)f \left(\frac{p(x)}{q(x)}\right) = p(x)f'(\infty) , we can extend f-divergence to the P\not\ll Q.

Properties

= Basic relations between f-divergences =

  • Linearity: D_{\sum_i a_i f_i} = \sum_i a_i D_{f_i} given a finite sequence of nonnegative real numbers a_i and generators f_i.
  • D_f = D_g iff f(x) = g(x) + c(x-1) for some c\in \R.

{{Math proof|drop=hidden|proof=

If f(x) = g(x) + c(x-1), then D_f = D_g by definition.

Conversely, if D_f - D_g = 0, then let h = f-g. For any two probability measures P, Q on the set \{0, 1\}, since D_f(P\| Q) - D_g(P\|Q) = 0, we get

h(P_1/Q_1) = -\frac{Q_0}{Q_1}h(P_0/Q_0)

Since each probability measure P, Q has one degree of freedom, we can solve \frac{P_0}{Q_0} = a, \frac{P_1}{Q_1} = x for every choice of 0 < a < 1 < x.

Linear algebra yields Q_0 = \frac{x-1}{x-a}, Q_1 = \frac{1-a}{x-a}, which is a valid probability measure. Then we obtain h(x) = \frac{h(a)}{a-1}(x-1), h(a) = \frac{h(x)}{x-1}(a-1).

Thus

h(x)=\begin{cases}

c_1(x-1)\quad\text{if } x>1,\\

c_0(x-1)\quad\text{if } 0

\end{cases}

for some constants c_0, c_1. Plugging the formula into h(x) = \frac{h(a)}{a-1}(x-1) yields c_0 = c_1.

}}

= Basic properties of f-divergences =

{{unordered list

|1= Non-negativity: the ƒ-divergence is always positive; it is zero if the measures P and Q coincide. This follows immediately from Jensen’s inequality:

:

D_f(P\!\parallel\!Q) = \int \!f\bigg(\frac{dP}{dQ}\bigg)dQ \geq f\bigg( \int\frac{dP}{dQ}dQ\bigg) = f(1) = 0.

|2= Data-processing inequality: if κ is an arbitrary transition probability that transforms measures P and Q into Pκ and Qκ correspondingly, then

:

D_f(P\!\parallel\!Q) \geq D_f(P_\kappa\!\parallel\!Q_\kappa).

The equality here holds if and only if the transition is induced from a sufficient statistic with respect to {P, Q}.

|3= Joint convexity: for any {{nowrap|0 ≤ λ ≤ 1}},

:

D_f\Big(\lambda P_1 + (1-\lambda)P_2 \parallel \lambda Q_1 + (1-\lambda)Q_2\Big) \leq \lambda D_f(P_1\!\parallel\!Q_1) + (1-\lambda)D_f(P_2\!\parallel\!Q_2).

This follows from the convexity of the mapping (p,q) \mapsto q f(p/q) on \mathbb{R}_+^2.

|4= Reversal by convex inversion: for any function f, its convex inversion is defined as g(t):= t f(1/t). When f satisfies the defining features of a f-divergence generator (f(x) is finite for all x > 0, f(1)=0, and f(0)=\lim_{t\to 0^+} f(t)), then g satisfies the same features, and thus defines a f-divergence D_g. This is the "reverse" of D_f, in the sense that D_g(P\|Q) = D_f(Q\|P) for all P, Q that are absolutely continuous with respect to each other.

In this way, every f-divergence D_f can be turned symmetric by D_{\frac 1 2 (f + g)}. For example, performing this symmetrization turns KL-divergence into Jeffreys divergence.

}}

In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution P^* then D_f(P(t)\parallel P^*) is a monotonic (non-increasing) function of time, where the probability distribution P(t) is a solution of the Kolmogorov forward equations (or Master equation), used to describe the time evolution of the probability distribution in the Markov process. This means that all f-divergences D_f(P(t)\parallel P^*) are the Lyapunov functions of the Kolmogorov forward equations. The converse statement is also true: If H(P) is a Lyapunov function for all Markov chains with positive equilibrium P^* and is of the trace-form

(H(P)=\sum_{i}f(P_{i},P_{i}^{*})) then H(P)= D_f(P(t)\parallel P^*), for some convex function f.{{cite journal |last1= Gorban|first1= Pavel A.| date= 15 October 2003|title= Monotonically equivalent entropies and solution of additivity equation|journal= Physica A|volume= 328|issue=3–4 |pages= 380–390|doi=10.1016/S0378-4371(03)00578-8 |arxiv= cond-mat/0304131|bibcode= 2003PhyA..328..380G|s2cid= 14975501}}{{cite conference |title= Divergence, Optimization, Geometry|first= Shun'ichi |last= Amari |author-link= Shun'ichi Amari |year= 2009|conference= The 16th International Conference on Neural Information Processing (ICONIP 20009), Bangkok, Thailand, 1--5 December 2009 |editor=Leung, C.S. |editor2=Lee, M. |editor3=Chan, J.H.|series= Lecture Notes in Computer Science, vol 5863 |publisher= Springer |location= Berlin, Heidelberg |pages= 185–193 |doi=10.1007/978-3-642-10677-4_21 }} For example, Bregman divergences in general do not have such property and can increase in Markov processes.{{cite journal |last1= Gorban|first1= Alexander N.| date= 29 April 2014|title= General H-theorem and Entropies that Violate the Second Law|journal= Entropy|volume= 16|issue=5|pages= 2408–2432|doi=10.3390/e16052408|arxiv=1212.6767|bibcode= 2014Entrp..16.2408G|doi-access= free}}

= Analytic properties =

The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances ({{harvtxt|Nielsen|Nock|2013}}).

= Basic variational representation =

Let f^* be the convex conjugate of f. Let \mathrm{effdom}(f^*) be the effective domain of

f^*, that is, \mathrm{effdom}(f^*) = \{y : f^*(y) < \infty\}. Then we have two variational representations of D_f, which we describe below.

Under the above setup,

{{Math theorem

| name = Theorem

| math_statement = D_f(P; Q) = \sup_{g: \Omega\to \mathrm{effdom}(f^*)} E_P[g] - E_Q[f^* \circ g].

}}

This is Theorem 7.24 in.

== Example applications ==

Using this theorem on total variation distance, with generator f(x)= \frac 1 2 |x-1|, its convex conjugate is f^*(x^*) = \begin{cases}

x^* \text{ on } [-1/2, 1/2],\\

+\infty \text{ else.}

\end{cases}, and we obtain

TV(P\| Q) = \sup_{|g|\leq 1/2} E_P[g(X)] - E_Q[g(X)].

For chi-squared divergence, defined by f(x) = (x-1)^2, f^*(y) = y^2/4 + y, we obtain

\chi^2(P; Q) = \sup_g E_P[g(X)] - E_Q[g(X)^2/4 + g(X)].

Since the variation term is not affine-invariant in g, even though the domain over which g varies is affine-invariant, we can use up the affine-invariance to obtain a leaner expression.

Replacing g by a g + b and taking the maximum over a, b \in \R, we obtain

\chi^2(P; Q) = \sup_g \frac{(E_P[g(X)]-E_Q[g(X)])^2}{Var_Q[g(X)]},

which is just a few steps away from the Hammersley–Chapman–Robbins bound and the Cramér–Rao bound (Theorem 29.1 and its corollary in ).

For \alpha-divergence with \alpha \in (-\infty, 0)\cup(0, 1), we have f_\alpha(x) = \frac{x^\alpha - \alpha x - (1-\alpha)}{\alpha(\alpha-1)}, with range x\in [0, \infty). Its convex conjugate is f_\alpha^*(y)=\frac{1}{\alpha}(x(y)^\alpha - 1) with range y\in(-\infty, (1-\alpha)^{-1}), where x(y) = ((\alpha-1)y + 1)^{\frac{1}{\alpha-1}}.

Applying this theorem yields, after substitution with h = ((\alpha-1)g+1)^{\frac{1}{\alpha-1}},

D_\alpha(P\| Q) = \frac{1}{\alpha(1-\alpha)} - \inf_{h: \Omega\to (0,\infty)}\left(

E_Q\left[\frac{h^\alpha}{\alpha}\right]

+ E_P\left[\frac{h^{\alpha-1}}{1-\alpha}\right]

\right),

or, releasing the constraint on h,

D_\alpha(P\| Q) = \frac{1}{\alpha(1-\alpha)} - \inf_{h: \Omega\to \R}\left(

E_Q\left[\frac{|h|^\alpha}{\alpha}\right]

+ E_P\left[\frac{|h|^{\alpha-1}}{1-\alpha}\right]

\right).

Setting \alpha=-1 yields the variational representation of \chi^2-divergence obtained above.

The domain over which h varies is not affine-invariant in general, unlike the \chi^2-divergence case. The \chi^2-divergence is special, since in that case, we can remove the |\cdot | from |h|.

For general \alpha \in (-\infty, 0)\cup(0, 1), the domain over which h varies is merely scale invariant. Similar to above, we can replace h by a h, and take minimum over a>0 to obtain

D_\alpha(P\| Q) = \sup_{h >0} \left[\frac{1}{\alpha(1-\alpha)} \left(

1-\frac{E_P[h^{\alpha-1}]^\alpha}{E_Q[h^\alpha]^{\alpha-1}}

\right) \right].

Setting \alpha=\frac 1 2, and performing another substitution by g=\sqrt h, yields two variational representations of the squared Hellinger distance:

H^2(P\|Q) = \frac 1 2 D_{1/2}(P\| Q) = 2 - \inf_{h>0}\left(

E_Q\left[h(X)\right]

+ E_P\left[h(X)^{-1}\right]

\right),

H^2(P\|Q) = 2 \sup_{h > 0} \left(1-\sqrt{E_P[h^{-1}]E_Q[h]}\right).

Applying this theorem to the KL-divergence, defined by f(x) = x\ln x, f^*(y) = e^{y-1}, yields

D_{KL}(P; Q) =\sup_g E_P[g(X)] - e^{-1}E_Q[e^{g(X)}].

This is strictly less efficient than the Donsker–Varadhan representation

D_{KL}(P; Q) = \sup_g E_P[g(X)]- \ln E_Q[e^{g(X)}].

This defect is fixed by the next theorem.

=Improved variational representation=

Assume the setup in the beginning of this section ("Variational representations").

{{Math theorem

| name = Theorem

| math_statement = If f(x) = +\infty on

x<0 (redefine f if necessary), then

D_{f}(P \| Q)=f^{\prime}(\infty) P\left[S^{c}\right]+\sup _{g} \mathbb{E}_{P}\left[g 1_{S}\right]-\Psi_{Q, P}^{*}(g)

,

where

\Psi_{Q, P}^{*}(g) := \inf _{a \in \mathbb{R}} \mathbb{E}_{Q}\left[f^{*}(g(X)-a)\right]+a P[S]

and S:=\{q > 0\} , where q is the probability density function of Q with respect to some underlying measure.

In the special case of f^{\prime}(\infty)=+\infty, we have

D_{f}(P \| Q)=\sup _{g} \mathbb{E}_{P}[g]-\Psi_{Q}^{*}(g), \quad \Psi_{Q}^{*}(g) := \inf _{a \in \mathbb{R}} \mathbb{E}_{Q}\left[f^{*}(g(X)-a)\right]+a

.

}}

This is Theorem 7.25 in.

== Example applications ==

Applying this theorem to KL-divergence yields the Donsker–Varadhan representation.

Attempting to apply this theorem to the general \alpha-divergence with \alpha \in (-\infty, 0)\cup(0, 1) does not yield a closed-form solution.

Common examples of ''f''-divergences

The following table lists many of the common divergences between probability distributions and the possible generating functions to which they correspond. Notably, except for total variation distance, all others are special cases of \alpha-divergence, or linear sums of \alpha-divergences.

For each f-divergence D_f, its generating function is not uniquely defined, but only up to c\cdot(t-1), where c is any real constant. That is, for any f that generates an f-divergence, we have D_{f(t)} = D_{f(t) + c\cdot(t-1)}. This freedom is not only convenient, but actually necessary.

class="wikitable"
Divergence

! Corresponding f(t)

! Discrete Form

\chi^{\alpha}-divergence, \alpha \ge 1 \,

| \frac12 |t - 1|^{\alpha} \,

| \frac12 \sum_i \left|\frac{p_i - q_i}{q_i}\right|^\alpha q_i \,

Total variation distance ( \alpha = 1 \,)

| \frac12|t - 1| \,

| \frac12 \sum_i |p_i - q_i| \,

α-divergence

| \begin{cases}

\frac{t^{\alpha} - \alpha t - \left( 1 - \alpha \right)}{\alpha \left(\alpha - 1 \right)} & \text{if}\ \alpha\neq 0,\, \alpha\neq 1, \\

t\ln t-t+1, & \text{if}\ \alpha=1, \\

- \ln t +t-1, & \text{if}\ \alpha=0

\end{cases}

KL-divergence (\alpha=1)

| t \ln t

| \sum_i p_i \ln \frac{p_i}{q_i}

reverse KL-divergence (\alpha=0)

| - \ln t

| \sum_i q_i \ln \frac{q_i}{p_i}

Jensen–Shannon divergence

| \frac{1}{2} \left(t \ln t -(t + 1)\ln \left(\frac{t + 1}{2}\right)\right)

| \frac{1}{2} \sum_i \left( p_i \ln \frac{p_i}{(p_i + q_i)/2} + q_i \ln \frac{q_i}{(p_i + q_i)/2} \right)

Jeffreys divergence (KL + reverse KL)

| (t - 1)\ln(t)

| \sum_i (p_i - q_i) \ln \frac{p_i}{q_i}

squared Hellinger distance (\alpha=\frac 1 2)

| \frac{1}{2}(\sqrt{t} - 1)^2,\,1-\sqrt{t}

| \frac{1}{2}\sum_i (\sqrt{p_i} - \sqrt{q_i})^2; \; 1 - \sum_i \sqrt{p_i q_i}

Neyman \chi^2-divergence

| (t - 1)^2

| \sum_i \frac{(p_i - q_i)^2}{q_i}

Pearson \chi^2-divergence

| \frac{(t-1)^2}{t}

| \sum_i \frac{(p_i - q_i)^2}{p_i}

File:Alpha-divergence.svg

Let f_\alpha be the generator of \alpha-divergence, then f_\alpha and f_{1-\alpha} are convex inversions of each other, so D_{\alpha}(P\| Q) = D_{1-\alpha}(Q\| P) . In particular, this shows that the squared Hellinger distance and Jensen-Shannon divergence are symmetric.

In the literature, the \alpha-divergences are sometimes parametrized as

\begin{cases}

\frac{4}{1-\alpha^2}\big(1 - t^{(1+\alpha)/2}\big), & \text{if}\ \alpha\neq\pm1, \\

t \ln t, & \text{if}\ \alpha=1, \\

- \ln t, & \text{if}\ \alpha=-1

\end{cases}

which is equivalent to the parametrization in this page by substituting \alpha \leftarrow \frac{\alpha+1}{2}.

Relations to other statistical divergences

Here, we compare f-divergences with other statistical divergences.

= Rényi divergence =

The Rényi divergences is a family of divergences defined by

R_{\alpha} (P \| Q) = \frac{1}{\alpha-1}\log\Bigg(

E_Q\left[\left(\frac{dP}{dQ}\right)^\alpha\right]

\Bigg) \,

when \alpha \in (0, 1)\cup (1, +\infty). It is extended to the cases of \alpha =0, 1, +\infty by taking the limit.

Simple algebra shows that R_\alpha(P\| Q) = \frac{1}{\alpha - 1}\ln (1+\alpha(\alpha-1)D_\alpha(P\|Q)), where D_\alpha is the \alpha-divergence defined above.

= Bregman divergence =

The only f-divergence that is also a Bregman divergence is the KL divergence.{{Cite journal |last1=Jiao |first1=Jiantao |last2=Courtade |first2=Thomas |last3=No |first3=Albert |last4=Venkat |first4=Kartik |last5=Weissman |first5=Tsachy |date=December 2014 |title=Information Measures: the Curious Case of the Binary Alphabet |journal=IEEE Transactions on Information Theory |volume=60 |issue=12 |pages=7616–7626 |doi=10.1109/TIT.2014.2360184 |issn=0018-9448|arxiv=1404.6810 |s2cid=13108908 }}

= Integral probability metrics =

The only f-divergence that is also an integral probability metric is the total variation.{{cite arXiv |eprint=0901.2698 |last1=Sriperumbudur |first1=Bharath K. |last2=Fukumizu |first2=Kenji |last3=Gretton |first3=Arthur |last4=Schölkopf |first4=Bernhard |author-link4=Bernhard Schölkopf |last5=Lanckriet |first5=Gert R. G. |title=On integral probability metrics, φ-divergences and binary classification |year=2009 |class=cs.IT }}

Financial interpretation

A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities. Knowledge of the actual probabilities allows a player to profit from the game. For a large class of rational players the expected profit rate has the same general form as the ƒ-divergence.{{cite journal |last1= Soklakov|first1= Andrei N.| year= 2020|title= Economics of Disagreement—Financial Intuition for the Rényi Divergence|journal= Entropy|volume= 22|issue=8|page= 860|doi=10.3390/e22080860|pmid= 33286632|pmc= 7517462|arxiv= 1811.08308|bibcode= 2020Entrp..22..860S|doi-access= free}}

See also

References

{{Reflist}}

{{refbegin}}

  • {{cite journal

| first = I.

| last = Csiszár

| year = 1963

| title = Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoffschen Ketten

| journal = Magyar. Tud. Akad. Mat. Kutato Int. Kozl

| volume = 8

| pages = 85–108

}}

  • {{cite journal

| doi = 10.1143/JPSJ.18.328

| first = T.

| last = Morimoto

| year = 1963

| title = Markov processes and the H-theorem

| journal = J. Phys. Soc. Jpn.

| volume = 18

| issue = 3

| pages = 328–331

| bibcode = 1963JPSJ...18..328M

}}

  • {{cite journal

| first1 = S. M. | last1 = Ali

| first2 = S. D. | last2 = Silvey

| year = 1966

| title = A general class of coefficients of divergence of one distribution from another

| journal = Journal of the Royal Statistical Society, Series B

| volume = 28

| issue = 1

| pages = 131–142

| jstor = 2984279 | mr = 0196777

}}

  • {{cite journal

| first = I.

| last = Csiszár

| year = 1967

| title = Information-type measures of difference of probability distributions and indirect observation

| journal = Studia Scientiarum Mathematicarum Hungarica

| volume = 2

| pages = 229–318

| ref = CITEREFCsisz.C3.A1r1967

}}

  • {{cite journal

| first1 = I. | last1 = Csiszár | author-link1 = Imre Csiszár

| first2 = P. | last2 = Shields

| year = 2004

| title = Information Theory and Statistics: A Tutorial

| journal = Foundations and Trends in Communications and Information Theory

| volume = 1

| issue = 4

| pages = 417–528

| doi = 10.1561/0100000004

| url = http://www.renyi.hu/~csiszar/Publications/Information_Theory_and_Statistics:_A_Tutorial.pdf

| accessdate = 2009-04-08

}}

  • {{cite journal

| first1 = F. | last1 = Liese

| first2 = I. | last2 = Vajda

| year = 2006

| title = On divergences and informations in statistics and information theory

| journal = IEEE Transactions on Information Theory

| volume = 52

| issue = 10

| pages = 4394–4412

| doi = 10.1109/TIT.2006.881731

| s2cid = 2720215

}}

  • {{cite journal

| first1 = F. | last1 = Nielsen

| first2 = R. | last2 = Nock

| year = 2013

| title = On the Chi square and higher-order Chi distances for approximating f-divergences

| arxiv = 1309.3029

| doi=10.1109/LSP.2013.2288355

| volume=21

| journal=IEEE Signal Processing Letters

| issue = 1

| pages=10–13

| bibcode=2014ISPL...21...10N| s2cid = 4152365

}}

  • {{cite arXiv

| first1 = J-F. | last1 = Coeurjolly

| first2 = R. | last2 = Drouilhet

| year = 2006

| title = Normalized information-based divergences

| eprint = math/0604246

| ref = arXiv:math/0604246

}}

{{refend}}