Definition: For a function \(f: \real^d \to \real\), its gradient is the \(d \times 1\) column vector

\[\nabla f(\x) = \left[\tfrac{\partial f}{\partial x_1} \cdots \tfrac{\partial f}{\partial x_d} \right] \Tr.\]

For a function \(f: \real^d \to \real^k\), its gradient is the \(d \times k\) matrix with \(ij\)th element

\[\nabla \f(\x)_{ij} = \frac{\partial f_j(\x)}{\partial x_i}.\]

Identities

\[\begin{alignat*}{2} &\text{Inner product:} \hspace{8em} & \nabla_{\x} (\A \Tr \x) &= \A \\ &\text{Quadratic form:} & \nabla_{\x} (\x \Tr \A \Tr \x) &= (\A + \A \Tr)\x\\ &\text{Chain rule:} & \nabla_{\x} \f(\y) &= \nabla_{\x}\y \nabla_{\y} \f \\ &\text{Product rule:} & \nabla (\f \Tr \g) &= (\nabla\f)\g + (\nabla\g)\f \\ &\text{Inverse function theorem:} & \nabla_{\x}\y &= \left(\nabla_{\y}\x\right)^{-1} \end{alignat*}\]

Note that for the inverse function theorem to apply, the gradient must be invertible.

Usage

The above convention (that taking derivatives produces column vectors) is the one most often encountered in the statistics literature. However, this convention is arbitrary and it is also possible to adopt the alternative convention that derivative produce row vectors. Most authors stick to one convention and use the terms “gradient” and “derivative” interchangeably, although some authors reserve “gradient” specifically for the above convention and “derivative” for the alternate convention.