Sanov's theorem

{{multiple issues|

}}

In mathematics and information theory, Sanov's theorem gives a bound on the probability of observing an atypical sequence of samples from a given probability distribution. In the language of large deviations theory, Sanov's theorem identifies the rate function for large deviations of the empirical measure of a sequence of i.i.d. random variables.

Let A be a set of probability distributions over an alphabet X, and let q be an arbitrary distribution over X (where q may or may not be in A). Suppose we draw n i.i.d. samples from q, represented by the vector $x^n = (x_1, x_2, \ldots, x_n)$ . Then, we have the following bound on the probability that the empirical measure $\hat{p}_{x^n}$ of the samples falls within the set A:

: $q^n(\hat{p}_{x^n}\in A) \le (n+1)^$

2^{-nD_{\mathrm{KL}}(p^*||q)},

where

$q^n$ is the joint probability distribution on $X^n$ , and
$p^*$ is the information projection of q onto A.
$D_{\mathrm{KL}}(P \| Q)$ , the KL divergence, is given by:

D_{\mathrm{KL}}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}.

In words, the probability of drawing an atypical distribution is bounded by a function of the KL divergence from the true distribution to the atypical one; in the case that we consider a set of possible atypical distributions, there is a dominant atypical distribution, given by the information projection.

Furthermore, if A is a closed set, then

: $\lim_{n\to\infty}\frac{1}{n}\log q^n(\hat{p}_{x^n}\in A) = -D_{\mathrm{KL}}(p^*||q).$

Technical statement

Define:

$\Sigma$ is a finite set with size $\geq 2$ . Understood as “alphabet”.
$\Delta(\Sigma)$ is the simplex spanned by the alphabet. It is a subset of $\R^\Sigma$ .
$L_n$ is a random variable taking values in $\Delta(\Sigma)$ . Take $n$ samples from the distribution $\mu$ , then $L_n$ is the frequency probability vector for the sample.
$\mathcal L_n$ is the space of values that $L_n$ can take. In other words, it is

$\{(a_1/n, \dots, a_$

\Sigma

/n): \sum_i a_i = n, a_i \in \N\}Then, Sanov's theorem states:{{Cite journal |last1=Dembo |first1=Amir |last2=Zeitouni |first2=Ofer |date=2010 |title=Large Deviations Techniques and Applications |url=https://link.springer.com/book/10.1007/978-3-642-03311-7 |journal=Stochastic Modelling and Applied Probability |volume=38 |language=en |pages=16–17 |doi=10.1007/978-3-642-03311-7 |isbn=978-3-642-03310-0 |issn=0172-4568|url-access=subscription }}

For every measurable subset $S \in \Delta(\Sigma)$ , $-\inf_{\nu \in int(S)} D(\nu\|\mu) \leq \liminf_n \frac 1n \ln P_\mu(L_n \in S) \leq \limsup_n \frac 1n \ln P_\mu(L_n \in S) \leq -\inf_{\nu \in cl(S)} D(\nu \| \mu)$
For every open subset $U \in \Delta(\Sigma)$ , $-\lim_n \lim_{\nu \in U \cap \mathcal L_n} D(\nu\|\mu) = \lim_n \frac 1n \ln P_\mu(L_n \in S) = -\inf_{\nu \in U} D(\nu\|\mu)$

Here, $int(S)$ means the interior, and $cl(S)$ means the closure.

References

{{Cite book

| last1 = Cover | first1 = Thomas M.

| last2 = Thomas | first2 = Joy A.

| title = Elements of Information Theory

| url = https://archive.org/details/elementsinformat00cove_577 | url-access = limited | publisher = Wiley Interscience

| edition = 2

| date = 2006

| location = Hoboken, New Jersey

| pages = [https://archive.org/details/elementsinformat00cove_577/page/n387 362]| isbn = 9780471241959

}}

Sanov, I. N. (1957) "On the probability of large deviations of random variables". Mat. Sbornik 42(84), No. 1, 11–44.
Санов, И. Н. (1957) "О вероятности больших отклонений случайных величин". ''МАТЕМАТИЧЕСКИЙ СБОРНИК' 42(84), No. 1, 11–44.

Category:Information theory

Category:Probabilistic inequalities