Neural Tangent Kernel

The Neural Tangent Kernel is a way of understanding the training performance of Neural Networks by relating them to Kernel methods. Here we overview the results of the paper [Jacot et al. here]. The paper considers a deep neural network with a fixed amount of data and a fixed depth. The weights applied to neurons are initially independent and normally distributed. We take a limit where the width of each layer tends to infinity.

The main observations of the paper are the following:

  1. The function represented by a neural network changes according to a kernel when undergoing gradient descent training. This is called the Neural Tangent Kernel (NTK).
  2. Under a random normally distributed initialization, the NTK is random. However, under an infinite width limit, this kernel is non-random and the output of each neuron an independent gaussian distribution with zero mean and a fixed covariance.
  3. In the limit, these weights and thus this kernel will not change during training. This is called “Lazy training”.
  4. In this limiting regime, the neural network parameter converge quickly to a global minimum with weight parameters that are arbitrarily close to the initialized neural network.

These results are significant as they give a way of understanding why Neural networks converge to a optimal solution. (Neural networks are know to be highly non-convex objects and so understanding their convergence under training is highly non-trivial.) What the following argument does not (fully) explain is why neural networks are so expressive, why they generalize well to unseen, or why in practice neural networks outperform kernel methods.

Why Lazy Training Helps Convergence.

We give a short heuristic explanation as to why Lazy training helps us understand the convergence of neural networks. Suppose that the weights of a neural network remain close to the values of their initial weights. That is

Now suppose we wish to solve

where here (x^{(i)},y^{(i)}), i=1,...,N is our data.

Notice under stochastic gradient decent W evolves (approximately) according to the o.d.e.

where here \hat{\mathbb E} is the empirical distribution of our data. (I.e. (\hat x,\hat y) is selected uniformly at random from (x^{(i)},y^{(i)}), i=1,...,N.) Therefore, by the chain rule, we expect F({W(t)},x) to evolve as

The term in curly brackets defines a kernel:

This kernel is the Neural Tangent Kernel; it’s the kernel that you get from the tangent of a neural network, \nabla_W F({W},x). Under the assumption that training is lazy this kernel should be constant i.e.

Now if the Kernel defines a positive definate matrix on the data (I.e. if \hat K = (K^{W_0}(x^{(i)},x^{(j)})/n : i,j=1,...,n) is a positive definite matrix), then for x=x^{(i)} becomes

(*)

We can analyze the error between the Neural network’s estimate and the data \epsilon(t) = ( F(W(t),x^{(i)}) - y^{(i)}: i =1,...,n) for which becomes

Since \hat K is positive definite

Now we see that if weights and the Neural Tangent Kernel defined by is (approximately) constant then the neural network trained under converges fast to a state with zero loss.


Next we need to argue that weights and the neural tangent kernel remain approximately constant during training. But first we define a bit more notation.

Neural Network Model.

We consider a dense L-layer neural network. Here n_0,...,n_L gives the number of activations on each layer. The activation functions on each layer are given by the same Lipschitz continuous function \sigma:\mathbb R \rightarrow \mathbb R.

We let w^{(l)}\in \mathbb R^{n_{l+1}\times (n_l+1)} weights for the l-th layer, and we let W=(w^{(1)},...,w^{(L)}) be the weights applied at each layer. We let P=(n_1 +1) \times ... \times (n_L+1).

Given the activations a^{(l)} = (a^{(l)}_j : j =1,..,n_l) for layer l, we define the pre-activations

and then we define

(Notice that above we rescale the weights by a factor of \frac{1}{\sqrt{n_l}} or \beta.)

Starting with inputs x= (x_i : i = 1,...,n_l) and defining a^{(0)}=x, we can recursively apply the above equations to get output z^{(L)}. We define F(W,x) to be the function that maps inputs to outputs, i.e. for input x\in\mathbb R^{n_0}, we recursively apply the above equations and to output F(W,x) = z^{(L)} \in \mathbb R^{n_L}. 1

We recall the definition of the Neural Tangent Kernel as given above

Definition [Neural Tangent Kernel] For a neural network with weights W, the neural tangent kernel K^W=(K^W_{ij}(x,x') \in \mathbb R : i,j=1,...,n_0, x,x'\in \mathbb R^{n_0}) is defined by

where here p=1,..,,P indexes all the weights of our neural network.


Neural Networks with Gaussian Weights.

The following shows how initializing the Neural Network effects the distribution of its output. We show that in a multi-layer neural network initialized with Gaussian weights at each neuron as an output that is independent and Gaussian distributed when the width of each layer tends to infinity.

In this limit, each layer affects the covariance of the next layer in a constant way. We can use this to show that the initial Neural Tangent Kernel (which is random in the prelimit) tends to a non-random Kernel.

Proposition 1. In the limit as n_1,...,n_{L-1}\rightarrow \infty, the output of F(W,x) converges to a normal distribution where in the limit each component F_i(W,\cdot) is independent over i, has mean zero and covariance \Sigma^{(L)} which satisfies the recursion

The Neural Tangent Kernel is such that K^W_{ij}(x,x') converges to zero for i\neq j, whereas for i=j the NTK converges to a non-random real valued kernel satisfying the recursion

where

Proof. The result follows by induction on the number of layers L. For L=1 it is not too hard to check the above conditions since \frac{1}{\sqrt{n_0}} w x + w_0 \beta is Gaussian and there is no limit to be taken.

Lets assume the induction hypothesis that, in the limit where n_1,...,n_{L-1}\rightarrow \infty, the output from level L, F^{(L)}(W,x) is independent mean zero Gaussian with the covariance between inputs x and x' given by \Sigma^{(l)}(x,x'). Also assume that, with probability 1, the NTK satisfies

where K^{(L)} is some deterministic kernel.

Knowing what happens layer L, let’s consider what happens at layer L+1 when n_L \rightarrow\infty. Recall that the output at level L+1 is

and, if we differentiate a parameter indexed by p from layer l then (Note that this is just BackPropogation)

Let’s analyse . Notice that if we condition of the weights upto layer L then the activations \sigma(F^{(L)}_j(W,x)) are fixed. So then F_i^{(L+1)}(W,x) in is Normally distributed

where

and the outputs are (conditionally) independent over i. By the induction hypothesis, the terms in the sum above are i.i.d. as n_1,...,n_{L-1} \rightarrow\infty. So the strong law of large numbers applies to the sum as we let n_L \rightarrow \infty

where \Sigma^{(L+1)}(x,x') is as stated above. Thus given the deterministic limit for the covariance , the distribution of F_i^{L+1}(W,\cdot) in has the same limit: it is normally distributed limit with mean zero and covariance \Sigma^{(L+1)}(x,x') and is independent over i, as required.

Next, let’s analyse the NTK. Given , we can see that there are two cases depending on whether the weights belong to the final layer or not: I.e. the NTK is

Given , we deal with the two terms (A) and (B) separately.

Notice from , the terms in (A) are all zero unless i=j. If i=j then (A) is exactly \hat \Sigma^{(L+1)}(x,x'), above, and so limits to \Sigma^{(L+1)}(x,x') as given in .

For term (B), by

So

In the above we note that term in square brackets is the NTK for the depth L network. Thus we apply . The only terms that are non-zero after taking this limit are terms where i'=j'. We are then left with a sum over i.i.d random variables (indexed by i') thus the strong law of large numbers gives convergence to the limit NTK, as stated. This completes the proof. QED.


Lazy Weights.

We now sketch out why weights remain approximately constant during training. Recall that weights change according to 

Previously we took d^{(L+1)}(t; x^{(i)})=(y^{(i)} - F({W(t)}, x^{(i)}) for the data i=1,...,N. Here we leave this training direction somewhat general. (Though it is important it stays bounded.)

In the limit where n_l \rightarrow \infty, it holds that w^{(l)}(0)/\sqrt{n_l} and a^{(l)}(0)/\sqrt{n_l} are (stochastically) bounded and that

(Here || \cdot ||_{op} is the operator norm and ||\cdot ||_{\hat{\mathbb E}} is the L^2 norm of the empirical distribution over the data.)

We give a sketch proof as otherwise we will spend too much time defining norms etc… To make notation a bit shorter we write \bar a and \bar w for a/\sqrt{n_l} and w/\sqrt{n_l}.

Proof Sketch. (see original paper for full proof — https://arxiv.org/abs/1806.07572 )  First we show ||\bar w^{(l)}(0)||_{op} is bounded. It is not hard to show (using Cauchy-Schwartz) that any n_{l+1} \times n_l matrix satisfies

If the components of w_{ij} are i.i.d. of finite variance, then dividing by \sqrt{n_l} and applying the strong law of large numbers gives a finite upper bound as n_l \rightarrow \infty. 2

Next we argue that the scaled activations \bar a^{(l)}(0) remains bounded. Notice that from Proposition [init] for any input x the preactivations F^{(l)} are independent identically distributed Gaussian. So a_i^{(l)} = \sigma(F_i^{(l)}) is independent over i. If we consider an Euclidean norm for any input x then we get the strong law of large numbers giving a finite limit to

Now let’s analyze the change in w^{(l)}(t). From above

We know for k >l backpropogation holds i.e. from recall that

and \partial_{w^{(l)}_{ij}} F_i^{(l)} = {a_j^{(l)}}/{\sqrt{n_l}}. We can repeatedly apply this to the above expression so that

where we inductively define

(Already at this point the division by \sqrt{n_l} in should make us think that things are not going to grow when we let n_l \rightarrow \infty)

Now it is not hard to check that \partial_t || g || \leq || \partial_t g || for any function g. Applying this to \partial_t w_{ij}^{(l)} above (dividing by \sqrt{n_l} and applying Cauchy-Schwartz) we get that

where c is the Lipschitz constant for the \sigma in a^{(l)} = \sigma(F^{(l)}).

We now have two terms to deal with ||F^{(l)} || and || d^{(l)}||. For ||F^{(l)} || we know that this changes according to the rule (*) above. In our case this says that

where K_{{W(t)}}^{(l)}(x,\hat x) is the NTK (at time t)

So to understand F^{(l)} we need to understand d^{(l)} and K^{(l)}_{W(t)}. Splitting into layers lower than l and at l note that K^{(l)} has the form

So

also analyzing d^{(l)}, from , we see that

Thus both || K^{(l)}|| and ||d^{(l)}|| bounded above by increasing polynomials in ||\bar w^{(l)}(t)|| and ||\bar F^{(l)}(t)|| (but, importantly, not depending on n_l, l=1,...,L). Thus we can bound above the same polynomial in \sum_l ||\bar w^{(l)}(t)||+||\bar F^{(l)}(t)||. Applying this to and , we see that

for some polynomial P. By (a polynomial version of) Grownall’s lemma it can be argued that the solution to this remains bounded for suitably large n_{\min}. For this reason the norm of both \bar w and \bar F is not changing in time. QED


  1. Note that we apply a linear activation in the final layer of this neural network. We will use the notation F^{(l)} (the output of an l-layer network) and z^{(l)} (The output of the l-th layer) somewhat interchangeably.
  2. The book of \cite{vershynin2018high} Theorem 4.4.5 gives tighter concentration bounds here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: