Exponential Families

The exponential family of distributions are a particularly tractable, yet broad, class of probability distributions. They are tractable because of a particularly nice [Fenchel] duality relationship between natural parameters and moment parameters. Moment parameters can be estimated by taking the empirical mean of sufficient statistics and the duality relationship can then recover an estimate of the distributions natural parameters.

Def [Exponential Family of Distributions] An exponential family is probability distributions of the form

where

for $\theta \in \Theta \subset \mathbb R^p$ . Here $\mu$ is some measure [e.g. counting measure or the Lebesgue measure]. The functions $\phi_j(x)$ , $j=1,..,J$ , are called sufficient statistics. The special case is where $\phi(x) = x$ are the natural exponential family of distributions. Finally, we define the moment parameters by for $X\sim p(\cdot |\theta)$ and we let $\theta( \eta)$ be the inverse of $\eta( \theta)$ .

A happy family! A large number of probability distributions: Normal, Poisson, Geometric, Binomial, Gamma, Exponential… are in the exponential family.

The really nice thing about exponential families is the relationship between sufficient statistics and natural parameters: there is a [Legendre-Fenchel] duality between them.

MLEs. A consequence of this is we can often calculate a maximum likelihood estimator (MLE). Suppose we know what the function $\theta( \eta)$ . If we get data $x^{(i)}$ for $i=1,...,n$ and we assuming the data is IID from an exponential family with $\theta$ unknown, then we can get the MLE $\hat{ \theta}$ , i.e. we can solve

Calculating
and taking

Cross Entropy. We can generalize this MLE statement slightly as follows. Suppose that $q(x)$ is some other probability distribution and we wish to minimize the relative entropy between $q$ and $p$ :

which we note is equivalent to maximizing the cross entropy

then $\theta^\star$ is the parameter such that the moments match

Some Results. Most of this can be verified through the following sequence of results which we do not prove here but are, for the most part, an application of Legendre-Fenchel duality to the log-likelihood function.

Suppose that $X\sim f(\cdot | \theta)$ is exponential family then

a) [MGF] The moment generating function of $\phi(X)$ is

b) [Moments]

c) [Convex] $\Phi(\theta)$ is strictly convex and the Legendre-Fenchel transform $\Phi^*( \eta) := \max\{ \eta^\top \theta - \Phi^*( \theta) \}$ and define

d) [Duality of Moments] If we define $p( x | \eta) = p(x | \theta ( \eta))$ then

Hence

e) [Relative Entropy] As discussed above, for any distribution $q$ the cross entropy $H(q, p(\cdot | \theta))$ is maximized by $\theta^\star$ such that

f) [MLE] For data $x^{(i)}$ for $i=1,...,n$ the MLE

is given by

A Brief Proof.
a) Follows from definition of $\Phi(\theta)$ .
b) Differentiate the MGF.
c) $\mathbb V_{\theta} ( \phi(X))$ is positive semi-definite. $\theta(\eta)$ is a def at this point so nothing to prove until d).
d) For LF-transforms if strictly convex then gradients [and their inverses] are unique.
e) Note this is just part c) with $\eta = \mathbb E_q [\phi(X)]$ . Note Fenchel transforms satisfy $\nabla f^*(\nabla f(x))=x$ and $\nabla f(\nabla f^*(x^*))=x^*$ . f) This is just part e) for the empirical distribution $q(x) = \frac{1}{n}\sum_i \delta_{x^{(i)}}(x)$ . $\square$