Neural Tangent Kernel

The Neural Tangent Kernel is a way of understanding the training performance of Neural Networks by relating them to Kernel methods. Here we overview the results of the paper [Jacot et al. here]. The paper considers a deep neural network with a fixed amount of data and a fixed depth. The weights applied to neurons are initially independent and normally distributed. We take a limit where the width of each layer tends to infinity.

Continue reading “Neural Tangent Kernel”

Stochastic Control Notes Update – 2021

I’ve updated the notes for this year’s stochastic control course, here:

Asside from general tidying. New material includes:

  • Equilibrium distributions of Markov chains
  • Occupancy measure of infinite time horizon MDPs
  • Linear programming as an algorithm for solving MDPs
  • Convergence of Asynchronous Value Iteration
  • (s,S) Inventory Control
  • POMDP (though there is still more to add)
  • Calculus of Variations (though there is still more to add)
  • Pontyagin’s Maximum Prinicple
  • Linear-Quadratic Lyapunov functions (Sylvester’s equation and Hurwitz matrices)
  • (some) Online Convex Optimization
  • Stochastic Bandits (UCB and Lai-Robbins Lower bound)
  • Gittins’ Index Theorem.
  • Sequential/Stochastic Linear Regression (Lai and Wei)
  • More discussion on TD methods
  • Discussion on double Q-learning and Dueling/Advantage updating
  • Convergence proof for SARSA
  • Policy Gradients (some convergence arguments from Bhanhari and Russo, but still more to do)
  • Cross Entropy Method (but still more to do)
  • Several new appendices (but mostly from old notes)

Like last year, I will likely update the notes further (and correct typos) towards the end of the course.

Stochastic Finite Arm Bandits

We discuss a canonical multi-arm bandit setting. The stochastic multi-arm bandit with finite arms and bounded rewards. We let a=1,...,N index the set of arms. We let \mathcal A=\{1,...,N\} be the set of arms. If you play the arm a at time t\in\mathbb N, you receive rewards r_t(a) which are independent and identically distributed in t. However, the distribution between arms may change.

We let \bar r(a) be the mean of arm a. We want to find the machine with highest mean reward.

Continue reading “Stochastic Finite Arm Bandits”

Perturbation of Markov Matrices

We consider the impact on the stationary distribution of perturbing a Markov chain’s transition matrix. This argument due to Seneta. We show if P and \tilde P are probability matrices of a irreducible finite state Markov chains with stationary distributions \pi and \pi then we can show that

Theorem.

\displaystyle || \pi - \tilde \pi ||_{TV} \leq \frac{1}{1- \tau(P)} \sum_{x} \left\| P_{x, \cdot} -\tilde P_{x, \cdot} \right\|_{TV}

where \tau(P) is the total variation between the rows of P.

Continue reading “Perturbation of Markov Matrices”

Exponential Families

The exponential family of distributions are a particularly tractable, yet broad, class of probability distributions. They are tractable because of a particularly nice [Fenchel] duality relationship between natural parameters and moment parameters. Moment parameters can be estimated by taking the empirical mean of sufficient statistics and the duality relationship can then recover an estimate of the distributions natural parameters.

Continue reading “Exponential Families”

A Short Discussion on Policy Gradients in Bandits

This is a quick note on policy gradients in bandit problems. There are a number of excellent papers coming out on the convergence of policy gradients [1 ,2, 3]. I wrote this sketch argument a few months ago. Lockdown and homeschooling started and I found I did not have time to pursue it. Then yesterday morning I came across the following paper Mei et al. [1], which uses a different Lyapunov argument to draw an essentially the same conclusion [first for bandits then for general RL]. Although most of the time I don’t talk about my own research on this blog, I wrote this short note here just in case this different Lyapunov argument is of interest.

Further, all these results on policy gradient are [thus far] for deterministic models. I wrote a short paper that deals with a stochastic regret bound for a policy gradient algorithm much closer to projected gradient descent, again in the bandit setting. I’ll summarize this, as it might be possible to use it to derive stochastic regret bounds for policy gradients more generally. At the end, I discuss some issues around getting some proofs to work in the stochastic case and emphasize the point that 1/q^\star is a good choice of lyapunov function [where $latex $q^\star$ is the probability of not choosing the optimal action.]

Continue reading “A Short Discussion on Policy Gradients in Bandits”