The information inequality is a lower bound on the variability of a statistic; the idea is very closely related to that of the Cramér-Rao lower bound.

See also: Gibbs inequality, which is unrelated but also sometimes referred to as the “information inequality”.

One dimension

Theorem (Information inequality): Let $\gh$ be a statistic with finite expectation $g(\theta) = \Ex \gh$. Suppose $X \sim p( \cdot | \theta^*)$ and $d/d\theta$ can be passed under the integral sign with respect to both $\int dP$ and $\int \gh dP$. Finally, suppose $\mathscr{I}(\ts) > 0$. Then

\[\Var \gh \ge \frac{\dot{g}(\theta^*)^2}{\mathscr{I}(\theta^*)}.\]

Proof: Special case of multivariate proof given below.

The information inequality pertains to a single observed $X$; for the case involving repeated sampling from a distribution, see Cramér-Rao lower bound. This is perhaps slightly misleading, since $X$ could be a “single observation” from the $n$-dimensional joint distribution of all the observations. The information inequality still applies to this case, but is expressed more succinctly in the language of the CRLB.

Multiple dimensions

Theorem (Information inequality): Suppose $X \sim p(x | \bts)$, with $\fI(\bts)$ positive definite. Let $\bgh$ be an estimator with finite expected value $g(\bt)$. If $\nabla_{\bt}^2 f(x|\bts)$ exists and can be passed under the integral sign with respect to $\int dP$ and $\int \bgh dP$, then

\[\Var \bgh \gge \nabla g(\bts) \Tr \fI(\bts)^{-1} \nabla g(\bts)\]

(the symbol $\A \gge \B$ is defined here).

Proof: Let’s begin by labeling our assumption for easy reference later:

  1. $\fI(\bts)$ is positive definite
  2. $g(\theta) = \Ex \bgh$ exists
  3. $\nabla^2_{\theta}$ passable with respect to $\int \, dP$
  4. $\nabla_{\theta}$ passable with respect to $\int \bgh \, dP$

Furthermore, let $k$ denote the dimension of $\bgh$ and $d$ denote the dimension of $\bt$. Now,

\[\begin{alignat*}{2} \tag*{$\tcirc{1}$} \nabla g(\theta) &= \nabla_\theta \int \bgh \, dP &\hspace{4em}& \text{ii.} \\ &= \int \nabla_\theta \bgh \, dP && \text{iv.} \\ &= \int \nabla_\theta L(\bt) \bgh \Tr \, dx && \href{vector-calculus.html}{\text{Chain rule}}: \pf{\bgh}{\bt} \, dP= \pf{p(x)}{\bt} \pf{\bgh \, p(x)}{p(x)} \, dx \\ &= \int \u(\bt) \bgh \Tr \, dP && \u = \nabla \log p = \nabla L / p \\ &= \cov(\u, \bgh) && \Ex \u = \zero \text{ by iii.} \end{alignat*}\]

Now, let us consider the variance of $\bgh - \nabla g \Tr \fI^{-1} \u$; note that this quantity does not exist without assumption i.:

\[\begin{alignat*}{2} \Var(\bgh - \nabla g \Tr \fI^{-1} \u) &= \Var \bgh - \cov(\bgh, \nabla g \Tr \fI^{-1} \u) - \cov(\nabla g \Tr \fI^{-1} \u, \bgh) + \Var(\nabla g \Tr \fI^{-1} \u) &\hspace{2em}& \text{i.} \\ &= \Var \bgh - \cov(\bgh, \u) \fI^{-1} \nabla g - \nabla g \Tr \fI^{-1} \cov(\u, \bgh) + \nabla g \Tr \fI^{-1} \fI \fI^{-1} \nabla g && \text{iii.} \\ &= \Var \bgh - \nabla g \Tr \fI^{-1} \nabla g && \tcirc{1} \\ &\gge 0 \end{alignat*}\]

where the last line follows because the left hand side is a variance and is therefore a positive semidefinite matrix. See here for additional justification on the variance/covariance manipulations.