We consider the problem where there is an amount of stock at time . You can perform the action to order units of stock. Further the demand at time is . We assume is independent over . The change in the amount of stock follows the dynamic:

## Lai Robbins Lower-Bound

We continue the earlier post on finite-arm stochastic multi-arm bandits. The results so far have suggest that, for independent identically distributed arms, the correct size of the regret is of the order . We now more formally prove this with the Lai and Robbins lower-bound .

Continue reading “Lai Robbins Lower-Bound”## Euler-Maruyama

The Euler-Maruyama scheme is a method of approximating an stochastic differential equation. Here we investigate two forms of error the scheme: the weak error and the strong error. The aim is to later we will cover Multi-Level Monte Carlo (MLMC) and related topics.

Continue reading “Euler-Maruyama”## Stochastic Finite Arm Bandits

We discuss a canonical multi-arm bandit setting. The stochastic multi-arm bandit with finite arms and bounded rewards. We let index the set of arms. We let be the set of arms. If you play the arm at time , you receive rewards which are independent and identically distributed in . However, the distribution between arms may change.

We let be the mean of arm . We want to find the machine with highest mean reward.

Continue reading “Stochastic Finite Arm Bandits”## Perturbation of Markov Matrices

We consider the impact on the stationary distribution of perturbing a Markov chain’s transition matrix. This argument due to Seneta. We show if and are probability matrices of a irreducible finite state Markov chains with stationary distributions and then we can show that

**Theorem.**

where is the total variation between the rows of .

Continue reading “Perturbation of Markov Matrices”## Cross Entropy Method

The Cross Entropy Method (CEM) is a generic optimization technique. It is a zero-th order method, i.e. you don’t gradients.^{1} So, for instance, it works well on combinatorial optimization problems, as well as reinforcement learning.

## Exponential Families

The exponential family of distributions are a particularly tractable, yet broad, class of probability distributions. They are tractable because of a particularly nice [Fenchel] duality relationship between natural parameters and moment parameters. Moment parameters can be estimated by taking the empirical mean of sufficient statistics and the duality relationship can then recover an estimate of the distributions natural parameters.

Continue reading “Exponential Families”## A Short Discussion on Policy Gradients in Bandits

This is a quick note on policy gradients in bandit problems. There are a number of excellent papers coming out on the convergence of policy gradients [1 ,2, 3]. I wrote this sketch argument a few months ago. Lockdown and homeschooling started and I found I did not have time to pursue it. Then yesterday morning I came across the following paper Mei et al. [1], which uses a different Lyapunov argument to draw an essentially the same conclusion [first for bandits then for general RL]. Although most of the time I don’t talk about my own research on this blog, I wrote this short note here just in case this different Lyapunov argument is of interest.

Further, all these results on policy gradient are [thus far] for deterministic models. I wrote a short paper that deals with a stochastic regret bound for a policy gradient algorithm much closer to projected gradient descent, again in the bandit setting. I’ll summarize this, as it might be possible to use it to derive stochastic regret bounds for policy gradients more generally. At the end, I discuss some issues around getting some proofs to work in the stochastic case and emphasize the point that is a good choice of lyapunov function [where $latex $q^\star$ is the probability of *not* choosing the optimal action.]

## Stochastic Control Notes Update

I’ve updated the stochastic control notes here:

## Stochastic_Control_2020_May.pdf

These still remain a work in progress.

(Typos/errors/mishaps found are welcome)

A quick summary of what is new:

- I’ve updated a section on the ODE method for stochastic approximation.
- I’ve improved the discussion around Temporal Difference methods and included some proofs.
- I’ve added a proof of convergence for SARSA.
- I’ve added a section on Lyapunov functions in continuous time
- La Salle, exponential convergence, and online convex optimization…

- I’ve started a section on Policy Gradients but there is more recent proofs to include
- I’ve started a section on Deep Learning for RL.

## Merton Portfolio Optimization

Here are the slides from Lectures

11_Merton Portfolio Optimization

Please read Section 2.4 of the notes