Bayesian Online Learning

We briefly describe an Online Bayesian Framework which is sometimes referred to as Assumed Density Filer (ADF). And we review a heuristic proof of its convergence in the Gaussian case.

Bayes Rule gives

For data $D_t$, parameter $\theta$ and new data point $y_t$.

ADF suggests projecting data at time $t$ to a parameter (vector) $par(t)$. This gives a routine that consists of the following two steps. (See [Opper] for the main reference article)

Update:

Project:

Here $D( p|| q)$ is the KL-divergence of distributions $p$ and $q$

Remark. Note that for exponential families of distributions:

then matching moments of $\mathbb E_{\theta} f_k(\theta)$ gives the minimization of the above.

Let’s assumes that $p(\theta,par)$ is a normally distributed with mean $\hat \theta$ and covariance matrix $C$.

Under this one can argue that $\hat \theta$ obeys the recursion

(1)

and $C(t)$ obeys the recursion:

(2)

Here $u$ is normal with mean zero and covariance $C(t)$. The partial derivative, $\partial_j$, above is taken with respect to the $j$th component of $\hat \theta(t)$ .

Quick Justification of (1) and (2)

Note that

A similar calculation gives the other expression on $C$.

For

This gives the differential equation

This implies

because

We assumes $y$ is drawn IID from a distribution $Q(y)$. We assumes there is an attractive fixed point $\theta^*$ satisfying

(3)

So

The last approximation that removes the normal distribution error needs justifying. The inequality with $J(\theta^*)$ assumes that $Q(y) = P(y | \theta^*)$ (in the case where they are not equal – i.e. when the model is miss specified – we just puts in some matrix $A$ instead of $J(\theta^*)$)

In principle $J(\theta^*)$ should not be too far from $J(\theta^*+u)$, because

imply that

so $C(t)$ the variance of $u$ goes to zero at rate $\frac{1}{t}$ justifying the approximation for $u=0$. From the above we see that “const” is $A$ (or $J(\theta^*)$ if the $Q(y) =P(y|\theta^*)$)). So

Next we start to analyse the error:

He notes that by and then a Taylor expansion that

Next we see that using

The sum on the right-hand side goes to zero because of . So we get

It is also possible to analyze

$E_{ij} = \mathbb E_{Q} [ \epsilon_i(t) \epsilon_j(t) ]$. The above expressions give

(again using ) which is solved by

This is actually the same convergence as expected by MLE estimates.

Literature

This is based on reading Opper and Winther:

Opper, Manfred, and Ole Winther. “A Bayesian approach to on-line learning.” On-line learning in neural networks (1998): 363-378.