Definition: For a function f:RdR, its gradient is the d×1 column vector

f(x)=[fx1fxd].

For a function f:RdRk, its gradient is the d×k matrix with ijth element

f(x)ij=fj(x)xi.

Identities

Inner product:x(Ax)=AQuadratic form:x(xAx)=(A+A)xChain rule:xf(y)=xyyfProduct rule:(fg)=(f)g+(g)fInverse function theorem:xy=(yx)1

Note that for the inverse function theorem to apply, the gradient must be invertible.

Usage

The above convention (that taking derivatives produces column vectors) is the one most often encountered in the statistics literature. However, this convention is arbitrary and it is also possible to adopt the alternative convention that derivative produce row vectors. Most authors stick to one convention and use the terms “gradient” and “derivative” interchangeably, although some authors reserve “gradient” specifically for the above convention and “derivative” for the alternate convention.