Total variation distance of probability measures

File:Total variation distance.svg

In probability theory, the total variation distance is a statistical distance between probability distributions, and is sometimes called the statistical distance, statistical difference or variational distance.

Definition

Consider a measurable space $(\Omega, \mathcal{F})$ and probability measures $P$ and $Q$ defined on $(\Omega, \mathcal{F})$ .

The total variation distance between $P$ and $Q$ is defined as{{cite web|last=Chatterjee |first=Sourav |authorlink=Sourav Chatterjee |title=Distances between probability measures |url=http://www.stat.berkeley.edu/~sourav/Lecture2.pdf |publisher=UC Berkeley |access-date=21 June 2013 |url-status=dead |archive-url=https://web.archive.org/web/20080708205758/http://www.stat.berkeley.edu/%7Esourav/Lecture2.pdf |archive-date=July 8, 2008 }}

: $\delta(P,Q)=\sup_{ A\in \mathcal{F}}\left|P(A)-Q(A)\right|.$

This is the largest absolute difference between the probabilities that the two probability distributions assign to the same event.

Properties

The total variation distance is an f-divergence and an integral probability metric.

= Relation to other distances =

The total variation distance is related to the Kullback–Leibler divergence by Pinsker’s inequality:

: $\delta(P,Q) \le \sqrt{\frac{1}{2} D_{\mathrm{KL}}(P\parallel Q)}.$

One also has the following inequality, due to Bretagnolle and HuberBretagnolle, J.; Huber, C, Estimation des densités: risque minimax, Séminaire de Probabilités, XII (Univ. Strasbourg, Strasbourg, 1976/1977), pp. 342–363, Lecture Notes in Math., 649, Springer, Berlin, 1978, Lemma 2.1 (French). (see also Tsybakov, Alexandre B., Introduction to nonparametric estimation, Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer Series in Statistics. Springer, New York, 2009. xii+214 pp. {{ISBN|978-0-387-79051-0}}, Equation 2.25.), which has the advantage of providing a non-vacuous bound even when $\textstyle D_{\mathrm{KL}}(P\parallel Q)>2\colon$

: $\delta(P,Q) \le \sqrt{1-e^{ -D_{\mathrm{KL}}(P\parallel Q) }}.$

The total variation distance is half of the L¹ distance between the probability functions:

on discrete domains, this is the distance between the probability mass functionsDavid A. Levin, Yuval Peres, Elizabeth L. Wilmer, Markov Chains and Mixing Times, 2nd. rev. ed. (AMS, 2017), Proposition 4.2, p. 48.

: $\delta(P, Q) = \frac12 \sum_{x} |P(x) - Q(x)|,$

and when the distributions have standard probability density functions {{mvar|p}} and {{mvar|q}},{{cite book |last1=Tsybakov |first1=Aleksandr B. |title=Introduction to nonparametric estimation |date=2009 |publisher=Springer |location=New York, NY |isbn=978-0-387-79051-0 |edition=rev. and extended version of the French Book |at=Lemma 2.1}}

: $\delta(P, Q) = \frac12 \int | p(x) - q(x) | \, \mathrm{d}x$

(or the analogous distance between Radon-Nikodym derivatives with any common dominating measure). This result can be shown by noticing that the supremum in the definition is achieved exactly at the set where one distribution dominates the other.{{Cite book |last=Devroye |first=Luc |authorlink=Luc Devroye |url=https://www.amazon.com/Probabilistic-Recognition-Stochastic-Modelling-Probability/dp/0387946187 |title=A Probabilistic Theory of Pattern Recognition |last2=Györfi |first2=Laszlo |last3=Lugosi |first3=Gabor |date=1996-04-04 |publisher=Springer |isbn=978-0-387-94618-4 |edition=Corrected |location=New York |language=en}}

The total variation distance is related to the Hellinger distance $H(P,Q)$ as follows:{{cite web |url=https://www.tcs.tifr.res.in/~prahladh/teaching/2011-12/comm/lectures/l12.pdf |title=Lecture notes on communication complexity |date=September 23, 2011 |first=Prahladh |last=Harsha }}

: $H^2(P,Q) \leq \delta(P,Q) \leq \sqrt 2 H(P,Q).$

These inequalities follow immediately from the inequalities between the 1-norm and the 2-norm.

= Connection to transportation theory =

The total variation distance (or half the norm) arises as the optimal transportation cost, when the cost function is $c(x,y) = {\mathbf{1}}_{x \neq y}$ , that is,

: $\frac{1}{2} \| P - Q \|_1 = \delta(P,Q) = \inf\big\{ \mathbb{P}(X\neq Y ) : \text{Law}(X) = P , \text{Law}(Y) = Q\big\} = \inf_\pi \operatorname{E}_{\pi}[{\mathbf{1}}_{x\neq y}],$

where the expectation is taken with respect to the probability measure $\pi$ on the space where $(x,y)$ lives, and the infimum is taken over all such $\pi$ with marginals $P$ and $Q$ , respectively.{{Cite book|title=Optimal Transport, Old and New|volume = 338|last=Villani|first=Cédric|authorlink=Cédric Villani|publisher=Springer-Verlag Berlin Heidelberg|year=2009|isbn=978-3-540-71049-3|pages=10|language=en|doi=10.1007/978-3-540-71050-9|series = Grundlehren der mathematischen Wissenschaften|url = https://cds.cern.ch/record/1621563}}

References

Category:Probability theory

Category:F-divergences