This article describes the row-orientation or “derivative” convention; see here for the column-orientation or “gradient” convention most often used in statistics.

Definition: For a function \(f: \real^d \to \real\), its derivative is the \(1 \times d\) row vector

\[\dot{f}(\x) = \left[\tfrac{\partial f}{\partial x_1} \cdots \tfrac{\partial f}{\partial x_d} \right].\]

For a function \(f: \real^d \to \real^k\), its derivative is the \(k \times d\) matrix with \(ij\)th element

\[\dot{\f}(\x)_{ij} = \frac{\partial f_i(\x)}{\partial x_j}.\]

Identities

\[\begin{alignat*}{2} &\text{Inner product:} \hspace{8em} & D_{\x}(\A \x) &= \A \\ &\text{Quadratic form:} & D_{\x}(\x \Tr \A \Tr \x) &= \x \Tr (\A + \A \Tr)\\ &\text{Chain rule:} & D_{\x} \f(\y) &= D_{\y}\f D_{\x}\y \\ &\text{Product rule:} & D (\f \Tr \g) &= \g \Tr \dot{\f} + \f \Tr \dot{\g} \\ &\text{Inverse function theorem:} & D_{\x}\y &= \left(D_{\y}\x\right)^{-1} \end{alignat*}\]

Note that for the inverse function theorem to apply, the derivative must be invertible

Relation to gradient form

The derivatives given above are the transposes of the gradients defined here:

\[\nabla f(\x) = \dot{f}(\x) \Tr .\]

Most authors stick to one convention and use the terms “gradient” and “derivative” interchangeably, although some authors reserve “derivative” specifically for the above convention and “gradient” for the column-oriented convention.