The Cross Entropy Method (CEM) is a generic optimization technique. It is a zero-th order method, i.e. you don’t gradients.^{1} So, for instance, it works well on combinatorial optimization problems, as well as reinforcement learning.

# Category: Uncategorized

## A Short Discussion on Policy Gradients in Bandits

This is a quick note on policy gradients in bandit problems. There are a number of excellent papers coming out on the convergence of policy gradients [1 ,2, 3]. I wrote this sketch argument a few months ago. Lockdown and homeschooling started and I found I did not have time to pursue it. Then yesterday morning I came across the following paper Mei et al. [1], which uses a different Lyapunov argument to draw an essentially the same conclusion [first for bandits then for general RL]. Although most of the time I don’t talk about my own research on this blog, I wrote this short note here just in case this different Lyapunov argument is of interest.

Further, all these results on policy gradient are [thus far] for deterministic models. I wrote a short paper that deals with a stochastic regret bound for a policy gradient algorithm much closer to projected gradient descent, again in the bandit setting. I’ll summarize this, as it might be possible to use it to derive stochastic regret bounds for policy gradients more generally. At the end, I discuss some issues around getting some proofs to work in the stochastic case and emphasize the point that is a good choice of lyapunov function [where $latex $q^\star$ is the probability of *not* choosing the optimal action.]

## Merton Portfolio Optimization

Here are the slides from Lectures

11_Merton Portfolio Optimization

Please read Section 2.4 of the notes

## Diffusion Control

Here are the slides from Lectures

Please read Section 2.3 of the notes

## Stochastic Integration (a quick intro)

Here are the slides from Lectures

Please read Section 2.2 of the notes

## Continuous Time Dynamic Programming

Here are the slides from Lectures

8_Continuous Time Dynamic Programming

Please read Section 2.1 of the notes

## LQR and Kalman Filter

Here are the slides from Lectures

Please read these notes [which will be later added to the main set of notes]:

## ODE method for Stochastic Approximation

We consider the Robbins-Monro update

and argue that this can be approximated by the o.d.e.:

## Optimal Stopping

Here are the slides from Lectures

Please read Section 1.6 from the notes:

Please attempt Ex53, Ex54, Ex56, Ex57.

## Algorithms for MDPs

Here are the slides from Lectures

Please read Section 1.5 from the notes:

Please attempt Ex39, 40 & 41 [if you can code], 42 and 43.