Kullback's inequality
In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function.{{cite journal |first1=Aimé |last1=Fuchs |first2=Giorgio |last2=Letta |title=L'inégalité de Kullback. Application à la théorie de l'estimation |journal=Séminaire de Probabilités de Strasbourg |series=Séminaire de probabilités |location=Strasbourg |volume=4 |pages=108–131 |year=1970 |url=http://www.numdam.org/item?id=SPS_1970__4__108_0 }} If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P << Q, and whose first moments exist, then
where is the rate function, i.e. the convex conjugate of the cumulant-generating function, of , and is the first moment of
The Cramér–Rao bound is a corollary of this result.
Proof
Let P and Q be probability distributions (measures) on the real line, whose first moments exist, and such that P << Q. Consider the natural exponential family of Q given by
= \frac{1}{M_Q(\theta)} \int_A e^{\theta x}Q(dx)
for every measurable set A, where is the moment-generating function of Q. (Note that Q0 = Q.) Then
+ \int_{\operatorname{supp}P}\left(\log\frac{\mathrm dQ_\theta}{\mathrm dQ}\right)\mathrm dP.
By Gibbs' inequality we have so that
\int_{\operatorname{supp}P}\left(\log\frac{\mathrm dQ_\theta}{\mathrm dQ}\right)\mathrm dP
= \int_{\operatorname{supp}P}\left(\log\frac{e^{\theta x}}{M_Q(\theta)}\right) P(dx)
Simplifying the right side, we have, for every real θ where
where is the first moment, or mean, of P, and is called the cumulant-generating function. Taking the supremum completes the process of convex conjugation and yields the rate function:
= \Psi_Q^*(\mu'_1(P)).
Corollary: the Cramér–Rao bound
{{main|Cramér–Rao bound}}
=Start with Kullback's inequality=
Let Xθ be a family of probability distributions on the real line indexed by the real parameter θ, and satisfying certain regularity conditions. Then
\ge \lim_{h\to 0} \frac {\Psi^*_\theta (\mu_{\theta+h})}{h^2},
where is the convex conjugate of the cumulant-generating function of and is the first moment of
=Left side=
The left side of this inequality can be simplified as follows:
\lim_{h\to 0} \frac {D_{KL}(X_{\theta+h}\parallel X_\theta)} {h^2} &=\lim_{h\to 0} \frac 1 {h^2} \int_{-\infty}^\infty \log \left( \frac{\mathrm dX_{\theta+h}}{\mathrm dX_\theta} \right) \mathrm dX_{\theta+h} \\
&=-\lim_{h\to 0} \frac 1 {h^2} \int_{-\infty}^\infty \log \left( \frac{\mathrm dX_{\theta}}{\mathrm dX_{\theta+h}} \right) \mathrm dX_{\theta+h} \\
&=-\lim_{h\to 0} \frac 1 {h^2} \int_{-\infty}^\infty \log\left( 1- \left (1-\frac{\mathrm dX_{\theta}}{\mathrm dX_{\theta+h}} \right ) \right) \mathrm dX_{\theta+h} \\
&= \lim_{h\to 0} \frac 1 {h^2} \int_{-\infty}^\infty \left[ \left( 1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) +\frac 1 2 \left( 1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) ^ 2
+ o \left( \left( 1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) ^ 2 \right) \right]\mathrm dX_{\theta+h} && \text{Taylor series for } \log(1-t) \\
&= \lim_{h\to 0} \frac 1 {h^2} \int_{-\infty}^\infty \left[ \frac 1 2 \left( 1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right)^2 \right]\mathrm dX_{\theta+h} \\
&= \lim_{h\to 0} \frac 1 {h^2} \int_{-\infty}^\infty \left[ \frac 1 2 \left( \frac{\mathrm dX_{\theta+h} - \mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right)^2 \right]\mathrm dX_{\theta+h} \\
&= \frac 1 2 \mathcal I_X(\theta)
\end{align}
which is half the Fisher information of the parameter θ.
=Right side=
The right side of the inequality can be developed as follows:
\lim_{h\to 0} \frac {\Psi^*_\theta (\mu_{\theta+h})}{h^2}
= \lim_{h\to 0} \frac 1 {h^2} {\sup_t \{\mu_{\theta+h}t - \Psi_\theta(t)\} }.
This supremum is attained at a value of t=τ where the first derivative of the cumulant-generating function is but we have so that
Moreover,
= \frac 1 {2\Psi''_\theta(0)}\left(\frac {d\mu_\theta}{d\theta}\right)^2
= \frac 1 {2\operatorname{Var}(X_\theta)}\left(\frac {d\mu_\theta}{d\theta}\right)^2.
=Putting both sides back together=
We have:
\ge \frac 1 {2\operatorname{Var}(X_\theta)}\left(\frac {d\mu_\theta}{d\theta}\right)^2,
which can be rearranged as: