Cross Entropy Method

The Cross Entropy Method (CEM) is a generic optimization technique. It is a zero-th order method, i.e. you don’t gradients.¹ So, for instance, it works well on combinatorial optimization problems, as well as reinforcement learning.

The Basic Idea. You want to maximize a function $S(x)$ over $x\in\mathcal X$ . We assume you can sample RVs $X_1,...,X_N$ from $\mathcal X$ according to some parameterized distribution $p(x|\theta)$ . We assume that you can fit the parameter $\theta$ given data $(X^{(1)},...,X^{(M)})$ . We let $\hat \theta$ be the fit to this data. Finally $\rho \in (0,1)$ is an importance parameter used by the cross entropy method. The basic CEM routine is now as follows:

Sample $X_1,...,X_N$ from $p(\cdot |\theta)$ and evaluate $S(X_1),...,S(X_N)$ .
Order the sample $X^{(1)},...,X^{(N)}$ so $S(X^{(1)}) \geq S(X^{(2)}) \geq ... \geq S(X^{(N)})$ .
Fit $\hat \theta$ to the top $M = \lfloor \rho N\rfloor$ samples, $X^{(1)},...,X^{(M)}$ and set $\theta \leftarrow \hat \theta$ .

The above three steps are repeated until convergence. The fit step of the cross entropy method uses a specific optimization to find $\hat \theta$ . In particular $\hat \theta$ maximizes the cross entropy of the sample $X^{(1)},...,X^{(M)}$ .² We define this shortly. But essentially $\hat \theta$ is just the maximum likelihood estimator of the data $X^{(1)},...,X^{(M)}$ [assuming it was IID from distribution $p(\cdot|\theta)$ for some $\theta$ ].

The update for $\theta$ can be smoothed out. Here we replace the update with

for some $\alpha \in (0,1]$ . Sometime we might want to smooth parameters differently, e.g. the mean and variance should be updated differently.

The Cross Entropy Method is a heuristic algorithm. There is some very limited results on its convergence. That said empirically it has enjoyed a lot a success and is well worth considering as a straight-forward method for a variety optimization problems.

Cross Entropy and Maximum Likelihood Estimation

The CEM recommends the update, $\theta\leftarrow \hat{\theta}$ , to maximize cross entropy. We define this here. However, really it is just a fancy way of saying that $\hat{ \theta}$ should be the maximum likelihood estimate (MLE) of our data. Often MLEs can be found by direct calculation, e.g. for multivariate Gaussian distributions or a categorical probability distribution. There’s a reasonably large family of probability distributions – namely, exponential families – for which we can maximize cross entropy and find MLEs in closed form.

If a closed form solutions cannot be found with numerical techniques, then methods, such as Newton methods or stochastic gradient descent, “could” be used. However, in RL, often this is the point where people start to use other gradient based methods directly on the MDP objective (e.g. TD, Q-learning or actor-critic algorithms) rather than expend computation on a secondary MLE parameter optimization. CEM is meant to be a quick easy to implement method to get things working before going the whole-hog into some more complex method [e.g. Deep RL].

Def [Relative Entropy and Entropy] For two probability distributions $p(x)$ and $q(x)$ the relative entropy between $p$ and $q$ is

Here $\mu(x)$ is some reference measure.³ The entropy of a distribution is

Relative entropy is a very common measure for the distance between two distributions. It arrises in Sanov’s Theorem. The relative entropy is positive; is convex in $p(x)$ and $q(x)$ ; and is minimized [equal to zero] when $p = q$ [almost everywhere w.r.t. $\mu$ ]. The entropy is upto a constant the relative entropy when $q$ is a uniform distribution with respect to $\mu$ . When we remove add entropy from the relative entropy [which gets rid of the $\int p \log p d\mu$ term], we get the cross entropy:

Def [Cross Entropy] For two probability distributions $p(x)$ and $q(x)$ the cross entropy is

Here $\mu(x)$ is some reference measure.

Cross Entropy and MLEs. Notice if $p(x)$ is the empirical distribution of some data, $X^{(1)},...,X^{(M)}$ i.e. $p(x) =\frac{1}{M} \sum_{i=1}^M \mathbb I [X^{(i)}=x]$ and if $q(x)$ is belongs to some parameterized family of distributions $p(x|\theta)$ then the cross entropy is

When we apply the Cross Entropy Method, we set

Notice here we minimize the cross entropy . Notice here we have maximized the log-likelihood function of our data. So we could interpret this as finding the maximum likelihood estimator (MLE) of our the data $X^{(1)},...,X^{(M)}$ . If we assumed the data is independent and from distribution $p(\cdot | \theta)$ for some $\theta$ .

Catagorial. Let $X\sim \mathcal Cat( p)$ . That is $\mathbb P(X=k) = p_k$ , for $k=1,...,K$ and where $p \in \mathbb R_+^K$ is such that $\sum_k p_k =1$ . Then the MLEs are

Multivariate Gaussian. If $X\sim \mathcal N(\mu , \Sigma)$ then the MLEs for $\mu$ and $\Sigma$ are

These are the two most important examples, as they deal with continuous and discrete data. We leave the proof of these examples as an exercise. However, we can go maximize MLEs for more general distributions.

Cross Entropy, MLEs and Exponential Families. We cover exponential families in more detail in this post. However, the main points are that $p(x|\theta)$ defines an exponential family, when

for natural parameters $\theta \in \Theta \subset \mathbb R^p$ . The functions $\phi(x) =( \phi_j(x): j=1,..,p)$ are called sufficient statistics. We define moment parameters by We let $\theta( \eta)$ be the inverse of $\eta( \theta)$ .⁴ We can solve

by calculating

Caveat: though you may [or may not] need gradient methods in the sub-routine that maximizes the cross entropy.↩
Note the one could instead [and indeed people do] use other alternatives to the cross entropy.↩
Think of $\mu$ as the counting measure for a sum or Lesbegue measure which corresponds to the usual integral taught in school.↩
$\theta( \eta)$ can be calculated by finding the dual of $\Phi(\theta)$ and differentiating. See Section [ExpFamilies].↩

Cross Entropy and Maximum Likelihood Estimation

Share this:

Leave a comment Cancel reply