Machine Learning – Applied Probability Notes

Zero-Order Stochastic Optimization: Keifer-Wolfowitz

We want to optimize the expected value of some random function. This is the problem we solved with Stochastic Gradient Descent. However, we assume that we no longer have access to unbiased estimate of the gradient. We only can obtain estimates of the function itself. In this case we can apply the Kiefer-Wolfowitz procedure.

The idea here is to replace the random gradient estimate used in stochastic gradient descent with a finite difference. If the increments used for these finite differences are sufficiently small, then over time convergence can be achieved. The approximation error for the finite difference has some impact on the rate of convergence.

Continue reading “Zero-Order Stochastic Optimization: Keifer-Wolfowitz”

Cross Entropy Method

The Cross Entropy Method (CEM) is a generic optimization technique. It is a zero-th order method, i.e. you don’t gradients.¹ So, for instance, it works well on combinatorial optimization problems, as well as reinforcement learning.

Exponential Families

The exponential family of distributions are a particularly tractable, yet broad, class of probability distributions. They are tractable because of a particularly nice [Fenchel] duality relationship between natural parameters and moment parameters. Moment parameters can be estimated by taking the empirical mean of sufficient statistics and the duality relationship can then recover an estimate of the distributions natural parameters.

Temporal Difference Learning – Linear Function Approximation

For a Markov chain $\hat{x} = (\hat x_t : t\in\mathbb Z_+)$ , consider the reward function

$\label{TD:Reward} R(x) := \mathbb E_{x} \left[ \sum_{t=0}^\infty r(\hat x_t) \right]$

associated with rewards given by $r = (r(x) : x\in\mathcal X)$ . We approximate the reward function $R(x)$ with a linear approximation,

$R(x;\bm w) = \bm w^\top \bm \phi (x) = \sum_{j\in \mathcal J} w_j \phi_j(x) .$

Continue reading “Temporal Difference Learning – Linear Function Approximation”

Bayesian Online Learning

We briefly describe an Online Bayesian Framework which is sometimes referred to as Assumed Density Filer (ADF). And we review a heuristic proof of its convergence in the Gaussian case.

Continue reading “Bayesian Online Learning”