Information projection

In information theory, the information projection or I-projection of a probability distribution q onto a set of distributions P is

: $p^* = \underset{p \in P}{\arg\min} \operatorname{D}_{\mathrm{KL}}(p||q)$ .

where $D_{\mathrm{KL}}$ is the Kullback–Leibler divergence from q to p. Viewing the Kullback–Leibler divergence as a measure of distance, the I-projection $p^*$ is the "closest" distribution to q of all the distributions in P.

The I-projection is useful in setting up information geometry, notably because of the following inequality, valid when P is convex:{{Cite book |last1 = Cover|first1 = Thomas M.|last2 = Thomas|first2 = Joy A.|title = Elements of Information Theory|publisher = Wiley Interscience|edition = 2|date = 2006|location = Hoboken, New Jersey|page = 367 (Theorem 11.6.1)}}

$\operatorname{D}_{\mathrm{KL}}(p||q) \geq \operatorname{D}_{\mathrm{KL}}(p||p^*) + \operatorname{D}_{\mathrm{KL}}(p^*||q)$ .

This inequality can be interpreted as an information-geometric version of Pythagoras' triangle-inequality theorem, where KL divergence is viewed as squared distance in a Euclidean space.

It is worthwhile to note that since $\operatorname{D}_{\mathrm{KL}}(p||q) \geq 0$ and continuous in p,

if P is closed and non-empty, then there exists at least one minimizer to the optimization problem framed above. Furthermore, if P is convex, then the optimum distribution is unique.

The reverse I-projection also known as moment projection or M-projection is

: $p^* = \underset{p \in P}{\arg\min} \operatorname{D}_{\mathrm{KL}}(q||p)$ .

Since the KL divergence is not symmetric in its arguments, the I-projection and the M-projection will exhibit different behavior. For I-projection, $p(x)$ will typically

under-estimate the support of $q(x)$ and will lock onto one of its modes. This is due to $p(x)=0$ , whenever $q(x)=0$ to make sure KL divergence stays finite. For M-projection, $p(x)$ will typically over-estimate the support of $q(x)$ . This is due to $p(x) > 0$ whenever $q(x) > 0$ to make sure KL divergence stays finite.

The reverse I-projection plays a fundamental role in the construction of optimal e-variables.

The concept of information projection can be extended to arbitrary f-divergences and other divergences.{{Cite journal |last1 = Nielsen|first1 = Frank | title = What is... an information projection?|journal =Notices of the American Mathematical Society |volume =65 | number =3|year = 2018|pages = 321–324|doi = 10.1090/noti1647 |url = https://www.ams.org/journals/notices/201803/rnoti-p321.pdf}}

References

K. Murphy, "Machine Learning: a Probabilistic Perspective", The MIT Press, 2012.

Category:Information theory

Information projection

See also

References