- High level idea: Policy Improvement and Policy Evaluation.
- Value Iteration; Policy Iteration.
- Temporal Differences; Q-factors.
- Positive Programming, Negative Programming & Discounted Programming.
- Optimality Conditions.
We are interested in solving the constrained optimization problem
We prove a powerful inequality which provides very tight gaussian tail bounds “” for probabilities on product state spaces . Talagrand’s Inequality has found lots of applications in probability and combinatorial optimization and, if one can apply it, it generally outperforms inequalities like Azzuma-Hoeffding.
We show that relative entropy decreases for continuous time Markov chains.
In the Cross Entropy Method, we wish to estimate the likelihood
Here is a random variable whose distribution is known and belongs to a parametrized family of densities . Further is often a solution to an optimization problem.
We consider the setting of sequentially optimizing the average of a sequence of functions, so called online convex optimization.