Bayesian Online Learning

We briefly describe an Online Bayesian Framework which is sometimes referred to as Assumed Density Filer (ADF). And we review a heuristic proof of its convergence in the Gaussian case.

Bayes Rule gives

For data D_t, parameter \theta and new data point y_t.

ADF suggests projecting data at time t to a parameter (vector) par(t). This gives a routine that consists of the following two steps. (See [Opper] for the main reference article)



Here D( p|| q) is the KL-divergence of distributions p and q

Remark. Note that for exponential families of distributions:

then matching moments of \mathbb E_{\theta} f_k(\theta) gives the minimization of the above.

Let’s assumes that p(\theta,par) is a normally distributed with mean \hat \theta and covariance matrix C.

Under this one can argue that \hat \theta obeys the recursion


and C(t) obeys the recursion:


Here u is normal with mean zero and covariance C(t). The partial derivative, \partial_j, above is taken with respect to the jth component of \hat \theta(t) .

Quick Justification of (1) and (2)

Note that

A similar calculation gives the other expression on C.


This gives the differential equation

This implies


We assumes y is drawn IID from a distribution Q(y). We assumes there is an attractive fixed point \theta^* satisfying



The last approximation that removes the normal distribution error needs justifying. The inequality with J(\theta^*) assumes that Q(y) = P(y | \theta^*) (in the case where they are not equal – i.e. when the model is miss specified – we just puts in some matrix A instead of J(\theta^*))

In principle J(\theta^*) should not be too far from J(\theta^*+u), because

imply that

so C(t) the variance of u goes to zero at rate \frac{1}{t} justifying the approximation for u=0. From the above we see that “const” is A (or J(\theta^*) if the Q(y) =P(y|\theta^*))). So

Next we start to analyse the error:

He notes that by and then a Taylor expansion that

Screenshot 2019-03-22 at 11.25.10.png

Next we see that using

The sum on the right-hand side goes to zero because of . So we get

It is also possible to analyze

E_{ij} = \mathbb E_{Q} [ \epsilon_i(t) \epsilon_j(t) ] . The above expressions give

(again using ) which is solved by

This is actually the same convergence as expected by MLE estimates.


This is based on reading Opper and Winther:

Opper, Manfred, and Ole Winther. “A Bayesian approach to on-line learning.” On-line learning in neural networks (1998): 363-378.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: