# Bayesian Online Learning

We briefly describe an Online Bayesian Framework which is sometimes referred to as Assumed Density Filer (ADF). And we review a heuristic proof of its convergence in the Gaussian case.

Bayes Rule gives For data $D_t$, parameter $\theta$ and new data point $y_t$.

ADF suggests projecting data at time $t$ to a parameter (vector) $par(t)$. This gives a routine that consists of the following two steps. (See [Opper] for the main reference article)

Update: Project: Here $D( p|| q)$ is the KL-divergence of distributions $p$ and $q$

Remark. Note that for exponential families of distributions: then matching moments of $\mathbb E_{\theta} f_k(\theta)$ gives the minimization of the above.

Let’s assumes that $p(\theta,par)$ is a normally distributed with mean $\hat \theta$ and covariance matrix $C$.

Under this one can argue that $\hat \theta$ obeys the recursion

(1) and $C(t)$ obeys the recursion:

(2) Here $u$ is normal with mean zero and covariance $C(t)$. The partial derivative, $\partial_j$, above is taken with respect to the $j$th component of $\hat \theta(t)$ .

### Quick Justification of (1) and (2)

Note that A similar calculation gives the other expression on $C$.

For This gives the differential equation This implies because We assumes $y$ is drawn IID from a distribution $Q(y)$. We assumes there is an attractive fixed point $\theta^*$ satisfying

(3) So The last approximation that removes the normal distribution error needs justifying. The inequality with $J(\theta^*)$ assumes that $Q(y) = P(y | \theta^*)$ (in the case where they are not equal – i.e. when the model is miss specified – we just puts in some matrix $A$ instead of $J(\theta^*)$)

In principle $J(\theta^*)$ should not be too far from $J(\theta^*+u)$, because imply that so $C(t)$ the variance of $u$ goes to zero at rate $\frac{1}{t}$ justifying the approximation for $u=0$. From the above we see that “const” is $A$ (or $J(\theta^*)$ if the $Q(y) =P(y|\theta^*)$)). So Next we start to analyse the error: He notes that by and then a Taylor expansion that Next we see that using The sum on the right-hand side goes to zero because of . So we get It is also possible to analyze $E_{ij} = \mathbb E_{Q} [ \epsilon_i(t) \epsilon_j(t) ]$. The above expressions give (again using ) which is solved by This is actually the same convergence as expected by MLE estimates.

Literature

This is based on reading Opper and Winther:

Opper, Manfred, and Ole Winther. “A Bayesian approach to on-line learning.” On-line learning in neural networks (1998): 363-378.