Probability – Applied Probability Notes

Markov chains and matrices — A Quick Summary

Probabilities for a Markov chain can be expressed in terms of the probability vector $\boldsymbol{\lambda}$ and the transition matrix $P$ .

For example, for $x, x_0, \ldots, x_t \in \mathcal{X}$ and a function $r : \mathcal{X} \rightarrow \mathbb{R}$ ,

$\displaystyle \mathbb{P}(X_0 = x_0, \ldots, X_t = x_t) = \lambda_{x_0} P_{x_0 x_1} \cdots P_{x_{t-1} x_t}$

$\displaystyle \mathbb{P}(X_t = x) = [\lambda P^t]_x$

$\displaystyle \mathbb{E}_x[r(X_t)] = (P^t r)_x$

So essentially, multiplication to the left gives probabilities and multiplication to the right gives expectations.

Notation explanation. If the above is unclear, some notation may need explaining. For $P^t = P \times P \cdots \times P$ , we multiply $t$ copies of the matrix $P$ together.¹ Interpreting $\boldsymbol{\lambda}$ as a row vector, $\boldsymbol{\lambda} P^t$ is a row vector, and $[\boldsymbol{\lambda} P^t]_x$ is the $x$ -th component of that row vector.

Similarly, we can think of the function $r : \mathcal{X} \rightarrow \mathbb{R}$ as a column vector $\boldsymbol{r} = (r(x) : x \in \mathcal{X})$ . Then $P^t \boldsymbol{r}$ is a column vector under matrix multiplication, and $(P^t \boldsymbol{r})_x$ denotes its $x$ -th component.

Throughout these notes, $\mathbb{P}_x(\cdot) = \mathbb{P}(\cdot \mid X_0 = x)$ and $\mathbb{E}_x[\cdot] = \mathbb{E}[\cdot \mid X_0 = x]$ .

¹ We use $\top$ for transpose, not $t$ .

Zero-Order Stochastic Optimization: Keifer-Wolfowitz

We want to optimize the expected value of some random function. This is the problem we solved with Stochastic Gradient Descent. However, we assume that we no longer have access to unbiased estimate of the gradient. We only can obtain estimates of the function itself. In this case we can apply the Kiefer-Wolfowitz procedure.

The idea here is to replace the random gradient estimate used in stochastic gradient descent with a finite difference. If the increments used for these finite differences are sufficiently small, then over time convergence can be achieved. The approximation error for the finite difference has some impact on the rate of convergence.

Continue reading “Zero-Order Stochastic Optimization: Keifer-Wolfowitz”

The Law of Large Numbers and Central Limit Theorem

Let’s explain why the normal distribution is so important.

(This is a section in the notes here.)

Continue reading “The Law of Large Numbers and Central Limit Theorem”

Continuous Probability Distributions

We consider distributions that have a continuous range of values. Discrete probability distributions where defined by a probability mass function. Analogously continuous probability distributions are defined by a probability density function.

(This is a section in the notes here.)

Continue reading “Continuous Probability Distributions”

Discrete Probability Distributions

There are some probability distributions that occur frequently. This is because they either have a particularly natural or simple construction. Or they arise as the limit of some simpler distribution. Here we cover

Bernoulli random variables
Binomial distribution
Geometric distribution
Poisson distribution.

(This is a section in the notes here.)

Continue reading “Discrete Probability Distributions”

Random Variables and Expectation

Often we are interested in the magnitude of an outcome as well as its probability. E.g. in a gambling game amount you win or loss is as important as the probability each outcome.

(This is a section in the notes here.)

Continue reading “Random Variables and Expectation”

Conditional Probability

(This is a section in the notes here.)

Conditional probabilities are probabilities where we have assumed that another event has occurred.

Continue reading “Conditional Probability”

Counting Principles

(This is a section in the notes here.)

Counting in Probability. If each outcome is equally likely, i.e. $\mathbb P( \omega ) = p$ for all $\omega \in \Omega$ , then since

$1 = \sum_{\omega \in \Omega } \mathbb P( \omega ) = \sum_{\omega \in \Omega } p= |\Omega | p$ (where $|\Omega|$ is the number of outcomes in the set $\Omega$ ) it must be that $\begin{aligned} \label{count1} \mathbb P(\omega) = \frac{1}{| \Omega |} \,\qquad \text{ for all } \omega \in \Omega .\end{aligned}$

Continue reading “Counting Principles”

Probability and Set Operations

(This is a section in the notes here.)

We want to calculate probabilities for different events. Events are sets of outcomes, and we recall that there are various ways of combining sets. The current section is a bit abstract but will become more useful for concrete calculations later.

Continue reading “Probability and Set Operations”