The following are the classical conditions to ensure that the likelihood is “regular”, meaning that at least asymptotically, the likelihood resembles that of a normal distribution. Note: saying “the classical conditions” is perhaps misleading, as there is some flexibility with respect to how these conditions are stated. Some additional commentary on this point is given below.
Core conditions
(A) IID: \(X_1, \ldots, X_n\) are iid with density \(p(x \vert \bt^*)\).
(B) Interior point: There exists an open set \(\bT^* \subset \bT \subset \real^d\) that contains \(\bt^*\).
(C) Smoothness: For all \(x\), \(p(x \vert \bt)\) is continuously differentiable with respect to \(\bt\) up to third order on \(\bT^*\), and satisfies the following conditions:
- (i) Derivatives up to second order can be passed under the integral sign in \(\int dP(x \vert \bt)\).
- (ii) The Fisher information \(\fI(\bt^*)\) is positive definite.
- (iii) The third derivatives \(\nabla^3 \ell(\bt \vert x)\) are bounded by \(M(x)\) on \(\bT^*\): \(\sup_{\bt \in \bT^*} \abs{\nabla^3 \ell(\bt \vert x)_{jkm}} \le M(x)\) for all \(j, k, m\), with \(\Ex M(X) < \infty\)
Note #1: Condition (C) describes what happens for a single observation. What happens in a random sample of \(n\) observations is governed by condition (A).
Note #2: Note that C(ii) applies to the Fisher information, while C(iii) applies to the derivative of the observed information. This is important! The observed information might randomly fail to be positive definite, but we don’t need to worry about that (asymptotically). Meanwhile, we need a bound on the observed derivatives, which can include \(x\). This means that our bound \(M(x)\) must be allowed to be random.
Note #3: Although not explicitly stated, the above conditions also ensure that both the observed information and Fisher information are continuous functions of \(\bt\).
-
All differentiable functions are continuous. Thus, by requiring the third derivative to exist, we require that the second derivative (the observed information) is continuous. Similarly, the score must be continuous as well.
-
In fact, these conditions ensure that the observed information is uniformly continuous over $\bTs$. For any $\eps>0$, choose $\delta < \eps/\Ex M(X)$. Then for any $\theta, \theta_0 \in \Theta^*$ satisfying $\abs{\theta-\theta_0}<\delta$ we have (for each observation)
\[\as{\mathcal{I}_i(\theta) = \mathcal{I}_i(\theta_0) + \dot{\mathcal{I}}_i(\bar{\theta}_i)(\theta-\theta_0),}\]where $\bar{\theta}_i$ is on the line segment connecting $\theta$ and $\theta_0$ and therefore also in $\bTs$. We therefore have
\[\as{\abs{\tfrac{1}{n}\mathcal{I}_n(\theta) - \tfrac{1}{n}\mathcal{I}_n(\theta_0)} &\le \tfrac{1}{n} \sum_i \abs{\dot{\oI}_i(\bar{\theta}_i) (\theta-\theta_0)} \\ &\le \tfrac{1}{n} \sum_i M(X_i) \delta \\ &\inP \Ex M(X) \delta \\ &< \eps;}\]note that here we can choose a single value of \(\delta\) that works for all \(\theta \in \Theta^*\). In the above, we assumed a single parameter for the sake of simplicity, but the argument is effectively the same in higher dimensions.
Uniform continuity is important because it provides uniform convergence of the observed information to the Fisher information: \(\tfrac{1}{n}\oI(\bth) \inP \fI(\bts)\) as \(\bth \inP \bts\). Note that this is more complex than ordinary convergence – we can’t simply use the law of large numbers or the continuous mapping theorem here because both the information and the point at which the information is being evaluated are changing simultaneously.
-
Similar arguments apply to the Fisher information. Just as the information (second derivative of the log-likelihood) is uniformly continuous over $\bTs$, the score (first derivative) is also uniformly continuous over $\bTs$ and the dominated convergence theorem applies:
\[\begin{alignat*}{2} \lim_{\bt \to \bt_0} \fI(\bt) &= \lim_{\bt \to \bt_0} \int \{\u(\bt) - \Ex\u(\bt)\} \{\u(\bt) - \Ex\u(\bt)\}\Tr \, dP &\hspace{4em}& \text{Definition of } \fI(\bt) \\ &= \int \lim_{\bt \to \bt_0} \{\u(\bt) - \Ex\u(\bt)\} \{\u(\bt) - \Ex\u(\bt)\}\Tr \, dP && \href{dominated-convergence-theorem.html}{\text{DCT}}\\ &= \int \{\u(\bt_0) - \Ex\u(\bt_0)\} \{\u(\bt_0) - \Ex\u(\bt_0)\}\Tr \, dP && \textnormal{Score is continuous} \\ &= \fI(\bt_0) && \text{Definition of } \fI(\bt_0) \\ \end{alignat*}\]for any \(\bt_0 \in \bT^*\).
Log-concavity
The above conditions apply only locally (within a neighborhood of \(\bt^*\)) and thus do not guarantee anything about the MLE, only about a local maximum near \(\bt^*\). To guarantee consistency and asymptotic normality of the MLE, the following stronger condition is needed, replacing C(ii).
(D) Log-concavity: The Fisher information \(\fI(\bt)\) is positive definite for all \(\bt \in \bT\), and \(\bT\) is a convex set.
Alternative statements
The above conditions are one way to state the conditions required for asymptotic normality of the MLE, but there is some flexibility. For example, the only reason condition C(i) is necessary is to ensure that \(\Ex \u(\bts) = \zero\) and \(\Ex \oI(\bts) = \fI(\bts)\). Thus, some authors will just state these conditions regarding the expected value of score and information directly rather than state necessary conditions for them.
Similarly, the purpose of C(iii) is to establish a uniform bound on the observed information over \(\bT^*\) – the core idea of the condition is that we have a single bound \(M(x)\) that works for all values of \(\bt \in \bT^*\). Bounding the third derivative is one way of accomplishing that. It is possible to relax this condition and require a uniform bound only on the second derivatives (plus some other conditions), although the resulting proofs of consistency and asymptotic normality become more complicated. For this reason, the simpler regularity conditions presented above are more common in practice.
Finally, the IID assumption merely presents a basic, standard case in which likelihood theory holds. Certainly, likelihood estimates are asymptotically normal in all manner of non-IID settings (multiple groups, regression, etc.) as well – likelihood theory would not be terrible useful if this were not true. If one understands how the theory works in the IID case, it is typically relatively straightforward to extend theoretical results to other cases.