Neural Tangent Kernel

The Neural Tangent Kernel is a way of understanding the training performance of Neural Networks by relating them to Kernel methods. Here we overview the results of the paper [Jacot et al. here]. The paper considers a deep neural network with a fixed amount of data and a fixed depth. The weights applied to neurons are initially independent and normally distributed. We take a limit where the width of each layer tends to infinity.

The main observations of the paper are the following:

The function represented by a neural network changes according to a kernel when undergoing gradient descent training. This is called the Neural Tangent Kernel (NTK).
Under a random normally distributed initialization, the NTK is random. However, under an infinite width limit, this kernel is non-random and the output of each neuron an independent gaussian distribution with zero mean and a fixed covariance.
In the limit, these weights and thus this kernel will not change during training. This is called “Lazy training”.
In this limiting regime, the neural network parameter converge quickly to a global minimum with weight parameters that are arbitrarily close to the initialized neural network.

These results are significant as they give a way of understanding why Neural networks converge to a optimal solution. (Neural networks are know to be highly non-convex objects and so understanding their convergence under training is highly non-trivial.) What the following argument does not (fully) explain is why neural networks are so expressive, why they generalize well to unseen, or why in practice neural networks outperform kernel methods.

Why Lazy Training Helps Convergence.

We give a short heuristic explanation as to why Lazy training helps us understand the convergence of neural networks. Suppose that the weights of a neural network remain close to the values of their initial weights. That is

$\begin{aligned} W(t) \approx W_0 \, .\end{aligned}$

Now suppose we wish to solve

$\begin{aligned} \text{minimize} \quad \frac{1}{N}\sum_{i=1}^N (y^{(i)}-F(W,x^{(i)}))^2 \quad \text{over} \quad W \in \mathbb R^P\end{aligned}$

where here $(x^{(i)},y^{(i)})$ , $i=1,...,N$ is our data.

Notice under stochastic gradient decent $W$ evolves (approximately) according to the o.d.e.

$\begin{aligned} \label{NTK:WTrain} \frac{dW}{dt} = \hat{\mathbb E} [ \nabla_W F({W(t)},\hat x)(\hat y - F({W(t)},\hat x) ) ] \, ,\end{aligned}$

where here $\hat{\mathbb E}$ is the empirical distribution of our data. (I.e. $(\hat x,\hat y)$ is selected uniformly at random from $(x^{(i)},y^{(i)})$ , $i=1,...,N$ .) Therefore, by the chain rule, we expect $F({W(t)},x)$ to evolve as

$\begin{aligned} \frac{d F({W(t)},x)}{dt} & =\nabla_W F({W(t)},x)^\top \frac{dW}{dt}\notag \\ & = - \hat{\mathbb E} \left[ \Big\{\nabla_W F({W(t)},x)^\top \nabla_W F({W(t)},\hat x) \Big\}( F({W(t)},\hat x) - \hat y) \right]\, . \label{NTK:diff0}\end{aligned}$

The term in curly brackets defines a kernel:

$\begin{aligned} \label{NTK:1} K^W(x,\hat x) = \nabla_W F({W},x)^\top \nabla_W F({W},\hat x)\, .\end{aligned}$

This kernel is the Neural Tangent Kernel; it’s the kernel that you get from the tangent of a neural network, $\nabla_W F({W},x)$ . Under the assumption that training is lazy this kernel should be constant i.e.

$\begin{aligned} K^{W(t)}(x,\hat x) \approx K^{W_0}(x,\hat x) \, .\end{aligned}$

Now if the Kernel defines a positive definate matrix on the data (I.e. if $\hat K = (K^{W_0}(x^{(i)},x^{(j)})/n : i,j=1,...,n)$ is a positive definite matrix), then for $x=x^{(i)}$ becomes

$\begin{aligned} \label{NTK:diff} \frac{d F({W(t)},x^{(i)})}{dt} & = - \sum_{j=1}^N \hat K (x^{(i)},x^{(j)}) ( F({W(t)},x^{(j)}) - y^{(j)}) \,,\quad i=1,...,N.\end{aligned}$

(*)

We can analyze the error between the Neural network’s estimate and the data $\epsilon(t) = ( F(W(t),x^{(i)}) - y^{(i)}: i =1,...,n)$ for which becomes

$\begin{aligned} \frac{d \epsilon}{dt} = - \hat K \epsilon \, .\end{aligned}$

Since $\hat K$ is positive definite

$\begin{aligned} \epsilon (t) = \epsilon (0) e^{-\hat K t} \xrightarrow[t\rightarrow\infty]{} 0 \, .\end{aligned}$

Now we see that if weights and the Neural Tangent Kernel defined by is (approximately) constant then the neural network trained under converges fast to a state with zero loss.

Next we need to argue that weights and the neural tangent kernel remain approximately constant during training. But first we define a bit more notation.

Neural Network Model.

We consider a dense $L$ -layer neural network. Here $n_0,...,n_L$ gives the number of activations on each layer. The activation functions on each layer are given by the same Lipschitz continuous function $\sigma:\mathbb R \rightarrow \mathbb R$ .

We let $w^{(l)}\in \mathbb R^{n_{l+1}\times (n_l+1)}$ weights for the $l$ -th layer, and we let $W=(w^{(1)},...,w^{(L)})$ be the weights applied at each layer. We let $P=(n_1 +1) \times ... \times (n_L+1)$ .

Given the activations $a^{(l)} = (a^{(l)}_j : j =1,..,n_l)$ for layer $l$ , we define the pre-activations

$\begin{aligned} \label{NTK:F1} z_i^{(l+1)} = \frac{1}{\sqrt{n_l}} \sum_{j=1}^{n_l} w_{ij}^{(l+1)} a_j^{(l)} + w_{i0}^{(l)} \beta \end{aligned}$

and then we define

$\begin{aligned} \label{NTK:F2} a_i^{(l+1)} = \sigma(z_i^{(l+1)}), \qquad i=1,...,n_l \, .\end{aligned}$

(Notice that above we rescale the weights by a factor of $\frac{1}{\sqrt{n_l}}$ or $\beta$ .)

Starting with inputs $x= (x_i : i = 1,...,n_l)$ and defining $a^{(0)}=x$ , we can recursively apply the above equations to get output $z^{(L)}$ . We define $F(W,x)$ to be the function that maps inputs to outputs, i.e. for input $x\in\mathbb R^{n_0}$ , we recursively apply the above equations and to output $F(W,x) = z^{(L)} \in \mathbb R^{n_L}$ . ¹

We recall the definition of the Neural Tangent Kernel as given above

Definition [Neural Tangent Kernel] For a neural network with weights $W$ , the neural tangent kernel $K^W=(K^W_{ij}(x,x') \in \mathbb R : i,j=1,...,n_0, x,x'\in \mathbb R^{n_0})$ is defined by

$\begin{aligned} K^W_{ij}(x,x')&:= \nabla_W F_i(W,x)^\top \nabla_W F_j(W,x') \\ &= \sum_{p=1}^{P} \partial_{w_p} F_i(W, x) \partial_{w_p} F_i(W,x')\end{aligned}$

where here $p=1,..,,P$ indexes all the weights of our neural network.

Neural Networks with Gaussian Weights.

The following shows how initializing the Neural Network effects the distribution of its output. We show that in a multi-layer neural network initialized with Gaussian weights at each neuron as an output that is independent and Gaussian distributed when the width of each layer tends to infinity.

In this limit, each layer affects the covariance of the next layer in a constant way. We can use this to show that the initial Neural Tangent Kernel (which is random in the prelimit) tends to a non-random Kernel.

Proposition 1. In the limit as $n_1,...,n_{L-1}\rightarrow \infty$ , the output of $F(W,x)$ converges to a normal distribution where in the limit each component $F_i(W,\cdot)$ is independent over $i$ , has mean zero and covariance $\Sigma^{(L)}$ which satisfies the recursion $\begin{aligned} \Sigma^{(1)}(x,x') &= \frac{1}{n_0} x^\top x' + \beta^2 \\ \Sigma^{(l+1)}(x,x') & = \mathbb E_{(z,z') \sim \mathcal N(0, \Sigma^{(l)}(x,x'))} \left[ \sigma(z) \sigma(z') \right] + \beta^2.\end{aligned}$

The Neural Tangent Kernel is such that $K^W_{ij}(x,x')$ converges to zero for $i\neq j$ , whereas for $i=j$ the NTK converges to a non-random real valued kernel satisfying the recursion

$\begin{aligned} K^{(1)}(x,x') = & \Sigma^{(1)}(x,x')\\ K^{(l+1)}(x,x') = & \dot{\Sigma}^{(l)}(x,x') K^{(l)}(x,x') + {\Sigma}^{(l)}(x,x') \, ,\end{aligned}$

where

$\begin{aligned} \dot \Sigma^{(l+1)}(x,x') & = \mathbb E_{(z,z') \sim \mathcal N(0, \Sigma^{(l)}(x,x'))} \left[ \dot \sigma(z)\dot \sigma(z') \right] \, .\end{aligned}$

Proof. The result follows by induction on the number of layers $L$ . For $L=1$ it is not too hard to check the above conditions since $\frac{1}{\sqrt{n_0}} w x + w_0 \beta$ is Gaussian and there is no limit to be taken.

Lets assume the induction hypothesis that, in the limit where $n_1,...,n_{L-1}\rightarrow \infty$ , the output from level $L$ , $F^{(L)}(W,x)$ is independent mean zero Gaussian with the covariance between inputs $x$ and $x'$ given by $\Sigma^{(l)}(x,x')$ . Also assume that, with probability $1$ , the NTK satisfies

$\begin{aligned} \label{NYK:Hyp} K^W(x,x') \xrightarrow[n_1,...,n_{L-1} \rightarrow \infty ]{} K^{(L)}(x,x')\end{aligned}$

where $K^{(L)}$ is some deterministic kernel.

Knowing what happens layer $L$ , let’s consider what happens at layer $L+1$ when $n_L \rightarrow\infty$ . Recall that the output at level $L+1$ is

$\begin{aligned} \label{NTK:F} F_i^{(L+1)}(W,x) = \frac{1}{\sqrt{n_l}} \sum_{j=1}^{n_{l+1}} w_{ij}^{(L+1)} \sigma(F^{(L)}_j(W,x))\end{aligned}$

and, if we differentiate a parameter indexed by $p$ from layer $l$ then $\begin{aligned} \label{NTK:dF} \partial_{w^{(l)}_p} F_i^{(L+1)} = \begin{cases} \frac{1}{\sqrt{n_{L+1}}} \sigma(F^{(L)}_j), & \text{if } l=L+1\text{ and } p =(i,j) \, , \\ \frac{1}{\sqrt{n_{L+1}}} \sum_{j=1}^{n_{L+1}} w_{ij}^{(L+1)} \dot \sigma(F_j^{(L)}) \partial_{w_p^{(l)}} F_j^{(L)}\, , & \text{if } l \leq L \, , \\ 0 \, , & \text{otherwise.} \end{cases}\end{aligned}$ (Note that this is just BackPropogation)

Let’s analyse . Notice that if we condition of the weights upto layer $L$ then the activations $\sigma(F^{(L)}_j(W,x))$ are fixed. So then $F_i^{(L+1)}(W,x)$ in is Normally distributed

$\begin{aligned} \label{NTK:FW} ( F_i^{(L+1)}(W,\cdot) | W^{(L)}) \sim \mathcal N (0, \hat \Sigma^{(L+1)}) \end{aligned}$

where

$\begin{aligned} \label{NTK:SigSum} \hat \Sigma^{(L+1)}(x,x') = \frac{1}{n_l} \sum_{i=1}^{n_L} \sigma(F_i^{(L)}(W^{(L)},x) )\sigma(F_i^{(L)}(W^{(L)},x')) \, .\end{aligned}$

and the outputs are (conditionally) independent over $i$ . By the induction hypothesis, the terms in the sum above are i.i.d. as $n_1,...,n_{L-1} \rightarrow\infty$ . So the strong law of large numbers applies to the sum as we let $n_L \rightarrow \infty$

$\begin{aligned} \label{NTK:Cov} \hat \Sigma^{(L+1)}(x,x') \xrightarrow[n_L \rightarrow \infty ]{w.p. 1} \Sigma^{(L+1)}(x,x')\, ,\end{aligned}$

where $\Sigma^{(L+1)}(x,x')$ is as stated above. Thus given the deterministic limit for the covariance , the distribution of $F_i^{L+1}(W,\cdot)$ in has the same limit: it is normally distributed limit with mean zero and covariance $\Sigma^{(L+1)}(x,x')$ and is independent over $i$ , as required.

Next, let’s analyse the NTK. Given , we can see that there are two cases depending on whether the weights belong to the final layer or not: I.e. the NTK is

$\begin{aligned} K_{ij} = \underbrace{ \frac{1}{{n_{L}}} \sum_{p=0}^{n_{L}} \partial_{w^{(L)}_p} F_i^{(L+1)} \partial_{w^{(L)}_p} F_j^{(L+1)} }_{(A)} + \underbrace{ \sum_{l \leq L} \sum_p \partial_{w^{(l)}_p} F_i^{(L+1)} \partial_{w^{(l)}_p} F_j^{(L+1)} }_{(B)}\end{aligned}$

Given , we deal with the two terms $(A)$ and $(B)$ separately.

Notice from , the terms in $(A)$ are all zero unless $i=j$ . If $i=j$ then $(A)$ is exactly $\hat \Sigma^{(L+1)}(x,x')$ , above, and so limits to $\Sigma^{(L+1)}(x,x')$ as given in .

For term $(B)$ , by

$\begin{aligned} \partial_{w^{(l)}_p} F_i^{(L+1)} \partial_{w^{(l)}_p} F_j^{(L+1)} = \sum_{i', j'} w_{ij'}^{(L+1)} w_{jj'}^{(L+1)} \dot{\sigma} ( F_{i'}^{(L)})\dot{\sigma} ( F_{j'}^{(L)}) \partial_{w^{(l)}_p} F_{i'}^{(L)} \partial_{w^{(l)}_p} F_{j'}^{(L)}\end{aligned}$ So

$\begin{aligned} &\sum_{l \leq L} \sum_p \partial_{w^{(l)}_p} F_i^{(L+1)}(x) \partial_{w^{(l)}_p} F_j^{(L+1)}(x') \\=& \sum_{i', j'} w_{ii'}^{(L+1)} w_{jj'}^{(L+1)} \dot{\sigma} ( F_{i'}^{(L)}(x))\dot{\sigma} ( F_{j'}^{(L)}(x')) \left[ \sum_{l \leq L} \sum_p \partial_{w^{(l)}_p} F_{i'}^{(L)}(x) \partial_{w^{(l)}_p} F_{j'}^{(L)}(x') \right] \\ \xrightarrow[n_1,..,n_{L-1} \rightarrow \infty ]{} & \sum_{i', j'} w_{ii'}^{(L+1)} w_{jj'}^{(L+1)} \dot{\sigma} ( F_{i'}^{(L)}(x))\dot{\sigma} ( F_{j'}^{(L)}(x')) K^{(L)}(x,x') \delta_{i'j'} \\ =& \sum_{i'=1}^{n_l} w_{ii'}^{(L+1)} w_{ji'}^{(L+1)} \dot{\sigma} ( F_{i'}^{(L)}(x))\dot{\sigma} ( F_{i'}^{(L)}(x')) K^{(L)}(x,x') \\ \xrightarrow[n_L \rightarrow \infty ]{} & \delta_{ij} \dot{\Sigma}(x,x') K^{(L)}(x,x') \, .\end{aligned}$ In the above we note that term in square brackets is the NTK for the depth $L$ network. Thus we apply . The only terms that are non-zero after taking this limit are terms where $i'=j'$ . We are then left with a sum over i.i.d random variables (indexed by $i'$ ) thus the strong law of large numbers gives convergence to the limit NTK, as stated. This completes the proof. QED.

Lazy Weights.

We now sketch out why weights remain approximately constant during training. Recall that weights change according to $\begin{aligned} \label{NTK:DW} \frac{dW}{dt} = \hat{\mathbb E} [ \nabla_W F^{(L+1)}({W(t)},\hat x)d^{(L+1)}(t) ) ] \end{aligned}$

Previously we took $d^{(L+1)}(t; x^{(i)})=(y^{(i)} - F({W(t)}, x^{(i)})$ for the data $i=1,...,N$ . Here we leave this training direction somewhat general. (Though it is important it stays bounded.)

In the limit where $n_l \rightarrow \infty$ , it holds that $w^{(l)}(0)/\sqrt{n_l}$ and $a^{(l)}(0)/\sqrt{n_l}$ are (stochastically) bounded and that

$\begin{aligned} \lim_{n_L \rightarrow \infty} ... \lim_{n_1 \rightarrow \infty} \left\| \frac{1}{\sqrt{n_l}} \left( w^{(l)}(t) - w^{(l)}(0)\right) \right\|_{op} & = 0 \\ \lim_{n_L \rightarrow \infty} ... \lim_{n_1 \rightarrow \infty} \left\| \frac{1}{\sqrt{n_l}} \left( F^{(l)}(t) - F^{(l)}(0)\right) \right\|_{\hat{\mathbb E}} & = 0\end{aligned}$

(Here $|| \cdot ||_{op}$ is the operator norm and $||\cdot ||_{\hat{\mathbb E}}$ is the $L^2$ norm of the empirical distribution over the data.)

We give a sketch proof as otherwise we will spend too much time defining norms etc… To make notation a bit shorter we write $\bar a$ and $\bar w$ for $a/\sqrt{n_l}$ and $w/\sqrt{n_l}$ .

Proof Sketch. (see original paper for full proof — https://arxiv.org/abs/1806.07572 ) First we show $||\bar w^{(l)}(0)||_{op}$ is bounded. It is not hard to show (using Cauchy-Schwartz) that any $n_{l+1} \times n_l$ matrix satisfies

$\begin{aligned} || w^{(l)}(0) ||_{op} \leq \sqrt{n_{l+1}} \max_{i=1,...,n_{l}} \left( \sum_{j=1}^{n_l} w_{ij}^{(l)}(0)^2 \right)^{\frac{1}{2}} \, .\end{aligned}$

If the components of $w_{ij}$ are i.i.d. of finite variance, then dividing by $\sqrt{n_l}$ and applying the strong law of large numbers gives a finite upper bound as $n_l \rightarrow \infty$ . ²

Next we argue that the scaled activations $\bar a^{(l)}(0)$ remains bounded. Notice that from Proposition [init] for any input $x$ the preactivations $F^{(l)}$ are independent identically distributed Gaussian. So $a_i^{(l)} = \sigma(F_i^{(l)})$ is independent over $i$ . If we consider an Euclidean norm for any input $x$ then we get the strong law of large numbers giving a finite limit to

$\begin{aligned} \lim_{n_l \rightarrow \infty} \frac{1}{n_l} \sum_{i=1}^{n_l} a_i^{(l)}(0)^2\, .\end{aligned}$

Now let’s analyze the change in $w^{(l)}(t)$ . From above

$\begin{aligned} \partial_t w_{ij}^{(l)}= \hat{\mathbb E} [ \partial_{w^{(l)}_{ij}}F^{(L+1)}({W(t)},\hat x) \cdot d^{(L+1)}(t, \hat x) ) ] \end{aligned}$

We know for $k >l$ backpropogation holds i.e. from recall that

$\begin{aligned} \label{NTK:dF} \partial_{w^{(l)}_p} F_i^{(k)} = \frac{1}{\sqrt{n_{k}}} \sum_{j=1}^{n_{k}} w_{ij}^{(k)} \dot \sigma(F_j^{(k-1)}) \partial_{w_p^{(l)}} F_j^{(k-1)}\, ,\end{aligned}$

and $\partial_{w^{(l)}_{ij}} F_i^{(l)} = {a_j^{(l)}}/{\sqrt{n_l}}$ . We can repeatedly apply this to the above expression so that

$\begin{aligned} \label{NTK:wn} \partial_t w_{ij}^{(l)} = \frac{1}{\sqrt{n_l}} \hat{\mathbb E} [ a^{(l)} \cdot d^{(l)}(t,\hat x) ]\end{aligned}$

where we inductively define

$\begin{aligned} \label{NTK:d} d^{(k-1)}(t) = \frac{1}{\sqrt{n_{k}}} \sum_{j=1}^{n_{k}} d^{(k)}(t) w_{ij}^{(k)} \dot \sigma(F_j^{(k-1)}) \, .\end{aligned}$

(Already at this point the division by $\sqrt{n_l}$ in should make us think that things are not going to grow when we let $n_l \rightarrow \infty$ )

Now it is not hard to check that $\partial_t || g || \leq || \partial_t g ||$ for any function $g$ . Applying this to $\partial_t w_{ij}^{(l)}$ above (dividing by $\sqrt{n_l}$ and applying Cauchy-Schwartz) we get that

$\begin{aligned} \label{NTK:wbar} \partial_t || \bar w^{(l)}|| \leq \frac{1}{\sqrt{n_l}}||\bar a^{(l)} ||\cdot || d^{(l)}|| \leq \frac{c}{\sqrt{n_l}}||\bar F^{(l)} ||\cdot || d^{(l)}||\end{aligned}$

where $c$ is the Lipschitz constant for the $\sigma$ in $a^{(l)} = \sigma(F^{(l)})$ .

We now have two terms to deal with $||F^{(l)} ||$ and $|| d^{(l)}||$ . For $||F^{(l)} ||$ we know that this changes according to the rule (*) above. In our case this says that

$\begin{aligned} \partial_t F^{(l)}({W(t)},x) & = \hat{\mathbb E} \left[ K_{{W(t)}}^{(l)}(x,\hat x) d^{(l)}(t,\hat x)\right]\, .\end{aligned}$

where $K_{{W(t)}}^{(l)}(x,\hat x)$ is the NTK (at time $t$ )

$\begin{aligned} \label{NTK:Fbar} \partial_t || \bar F^{(l)}({W(t)},x)|| & \leq \frac{1}{\sqrt{n_l}} ||K_{{W(t)}}^{(l)}||\cdot || d^{(l)}||\, .\end{aligned}$

So to understand $F^{(l)}$ we need to understand $d^{(l)}$ and $K^{(l)}_{W(t)}$ . Splitting into layers lower than $l$ and at $l$ note that $K^{(l)}$ has the form

$\begin{aligned} K^{(l)} = \frac{w^{(l)}}{\sqrt{n_l}}\dot \sigma K^{(l-1)} \dot \sigma \frac{w^{(l)} }{\sqrt{n_l}}+ \frac{a^{(l)}a^{(l)}}{n_l} + \beta^2\end{aligned}$

$\begin{aligned} || K^{(l)} || \leq \beta^2 + c^2 ||\bar w^{(l)}||^2 || K^{l-1} || + c || \bar F^{(l)} ||^2\end{aligned}$

also analyzing $d^{(l)}$ , from , we see that

$\begin{aligned} || d^{(l)} ||\leq c ||d^{(l-1)}|| ||\bar w ||_{op} \,.\end{aligned}$

Thus both $|| K^{(l)}||$ and $||d^{(l)}||$ bounded above by increasing polynomials in $||\bar w^{(l)}(t)||$ and $||\bar F^{(l)}(t)||$ (but, importantly, not depending on $n_l$ , $l=1,...,L$ ). Thus we can bound above the same polynomial in $\sum_l ||\bar w^{(l)}(t)||+||\bar F^{(l)}(t)||$ . Applying this to and , we see that

$\begin{aligned} \partial_t \left\{\sum_l ||\bar w^{(l)}(t)||+||\bar F^{(l)}(t)|| \right\} \leq \frac{1}{\sqrt{n_{\min}}}P \left( \sum_l ||\bar w^{(l)}(t)||+||\bar F^{(l)}(t)||\right)\end{aligned}$

for some polynomial $P$ . By (a polynomial version of) Grownall’s lemma it can be argued that the solution to this remains bounded for suitably large $n_{\min}$ . For this reason the norm of both $\bar w$ and $\bar F$ is not changing in time. QED

Note that we apply a linear activation in the final layer of this neural network. We will use the notation $F^{(l)}$ (the output of an $l$ -layer network) and $z^{(l)}$ (The output of the $l$ -th layer) somewhat interchangeably.↩
The book of \cite{vershynin2018high} Theorem 4.4.5 gives tighter concentration bounds here.↩

Why Lazy Training Helps Convergence.

Neural Network Model.

Neural Networks with Gaussian Weights.

Lazy Weights.

Share this:

Leave a comment Cancel reply