Neural network Gaussian process

From HandWiki
Short description: The distribution over functions corresponding to an infinitely wide Bayesian neural network.

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution.[1][2][3][4][5][6][7][8] The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

Motivation

Bayesian networks are a modeling tool for assigning probabilities to events, and thereby characterizing the uncertainty in a model's predictions. Deep learning and artificial neural networks are approaches used in machine learning to build computational models which learn from training examples. Bayesian neural networks merge these fields. They are a type of neural network whose parameters and predictions are both probabilistic.[9][10] While standard neural networks often assign high confidence even to incorrect predictions,[11] Bayesian neural networks can more accurately evaluate how likely their predictions are to be correct.

File:Infinitely wide neural network.webm

Computation in artificial neural networks is usually organized into sequential layers of artificial neurons. The number of neurons in a layer is called the layer width. When we consider a sequence of Bayesian neural networks with increasingly wide layers (see figure), they converge in distribution to a NNGP. This large width limit is of practical interest, since the networks often improve as layers get wider.[12][13][4][14] And the process may give a closed form way to evaluate networks.

NNGPs also appears in several other contexts: It describes the distribution over predictions made by wide non-Bayesian artificial neural networks after random initialization of their parameters, but before training; it appears as a term in neural tangent kernel prediction equations; it is used in deep information propagation to characterize whether hyperparameters and architectures will be trainable.[15] It is related to other large width limits of neural networks.

Scope

The first correspondence result had been established in the 1995 PhD thesis of Radford M. Neal,[16] then supervised by Geoffrey Hinton at University of Toronto. Neal cites David J. C. MacKay as inspiration, who worked in Bayesian learning.

Today the correspondence is proven for: Single hidden layer Bayesian neural networks;[16] deep[2][3] fully connected networks as the number of units per layer is taken to infinity; convolutional neural networks as the number of channels is taken to infinity;[4][5][6] transformer networks as the number of attention heads is taken to infinity;[17] recurrent networks as the number of units is taken to infinity.[8] In fact, this NNGP correspondence holds for almost any architecture: Generally, if an architecture can be expressed solely via matrix multiplication and coordinatewise nonlinearities (i.e., a tensor program), then it has an infinite-width GP.[8] This in particular includes all feedforward or recurrent neural networks composed of multilayer perceptron, recurrent neural networks (e.g., LSTMs, GRUs), (nD or graph) convolution, pooling, skip connection, attention, batch normalization, and/or layer normalization.

Illustration

When parameters [math]\displaystyle{ \theta }[/math] of an infinite width network are sampled repeatedly from their prior [math]\displaystyle{ p(\theta) }[/math], the resulting distribution over network outputs is described by a Gaussian process.

Every setting of a neural network's parameters [math]\displaystyle{ \theta }[/math] corresponds to a specific function computed by the neural network. A prior distribution [math]\displaystyle{ p(\theta) }[/math] over neural network parameters therefore corresponds to a prior distribution over functions computed by the network. As neural networks are made infinitely wide, this distribution over functions converges to a Gaussian process for many architectures.

The notation used in this section is the same as the notation used below to derive the correspondence between NNGPs and fully connected networks, and more details can be found there.

The figure to the right plots the one-dimensional outputs [math]\displaystyle{ z^L(\cdot;\theta) }[/math] of a neural network for two inputs [math]\displaystyle{ x }[/math] and [math]\displaystyle{ x^* }[/math] against each other. The black dots show the function computed by the neural network on these inputs for random draws of the parameters from [math]\displaystyle{ p(\theta) }[/math]. The red lines are iso-probability contours for the joint distribution over network outputs [math]\displaystyle{ z^L(x;\theta) }[/math] and [math]\displaystyle{ z^L(x^*;\theta) }[/math] induced by [math]\displaystyle{ p(\theta) }[/math]. This is the distribution in function space corresponding to the distribution [math]\displaystyle{ p(\theta) }[/math] in parameter space, and the black dots are samples from this distribution. For infinitely wide neural networks, since the distribution over functions computed by the neural network is a Gaussian process, the joint distribution over network outputs is a multivariate Gaussian for any finite set of network inputs.

Discussion

Infinitely wide fully connected network

This section expands on the correspondence between infinitely wide neural networks and Gaussian processes for the specific case of a fully connected architecture. It provides a proof sketch outlining why the correspondence holds, and introduces the specific functional form of the NNGP for fully connected networks. The proof sketch closely follows the approach by Novak and coauthors.[4]

Network architecture specification

An NNGP is derived which is equivalent to a Bayesian neural network with this fully connected architecture.

Consider a fully connected artificial neural network with inputs [math]\displaystyle{ x }[/math], parameters [math]\displaystyle{ \theta }[/math] consisting of weights [math]\displaystyle{ W^l }[/math] and biases [math]\displaystyle{ b^l }[/math] for each layer [math]\displaystyle{ l }[/math] in the network, pre-activations (pre-nonlinearity) [math]\displaystyle{ z^l }[/math], activations (post-nonlinearity) [math]\displaystyle{ y^l }[/math], pointwise nonlinearity [math]\displaystyle{ \phi(\cdot) }[/math], and layer widths [math]\displaystyle{ n^l }[/math]. For simplicity, the width [math]\displaystyle{ n^{L+1} }[/math] of the readout vector [math]\displaystyle{ z^L }[/math] is taken to be 1. The parameters of this network have a prior distribution [math]\displaystyle{ p(\theta) }[/math], which consists of an isotropic Gaussian for each weight and bias, with the variance of the weights scaled inversely with layer width. This network is illustrated in the figure to the right, and described by the following set of equations:

[math]\displaystyle{ \begin{align} x &\equiv \text{input} \\ y^l(x) &= \left\{\begin{array}{lcl} x & & l = 0 \\ \phi\left(z^{l-1}(x)\right) & & l \gt 0 \end{array}\right. \\ z^l_i(x) &= \sum_j W^l_{ij} y^l_j(x) + b^l_i \\ W^l_{ij} &\sim \mathcal N\left( 0, \frac{\sigma^2_w}{n^l} \right) \\ b^l_i &\sim \mathcal N\left( 0,\sigma^2_b \right) \\ \phi(\cdot) &\equiv \text{nonlinearity} \\ y^l(x), z^{l-1}(x) &\in \mathbb R^{n^l \times 1} \\ n^{L+1} &= 1 \\ \theta &= \left\{ W^0, b^0, \dots, W^L, b^L \right\} \end{align} }[/math]

[math]\displaystyle{ z^l | y^l }[/math] is a Gaussian process

We first observe that the pre-activations [math]\displaystyle{ z^l }[/math] are described by a Gaussian process conditioned on the preceding activations [math]\displaystyle{ y^l }[/math]. This result holds even at finite width. Each pre-activation [math]\displaystyle{ z^l_i }[/math] is a weighted sum of Gaussian random variables, corresponding to the weights [math]\displaystyle{ W^l_{ij} }[/math] and biases [math]\displaystyle{ b^l_i }[/math], where the coefficients for each of those Gaussian variables are the preceding activations [math]\displaystyle{ y^l_j }[/math]. Because they are a weighted sum of zero-mean Gaussians, the [math]\displaystyle{ z^l_i }[/math] are themselves zero-mean Gaussians (conditioned on the coefficients [math]\displaystyle{ y^l_j }[/math]). Since the [math]\displaystyle{ z^l }[/math] are jointly Gaussian for any set of [math]\displaystyle{ y^l }[/math], they are described by a Gaussian process conditioned on the preceding activations [math]\displaystyle{ y^l }[/math]. The covariance or kernel of this Gaussian process depends on the weight and bias variances [math]\displaystyle{ \sigma_w^2 }[/math] and [math]\displaystyle{ \sigma_b^2 }[/math], as well as the second moment matrix [math]\displaystyle{ K^l }[/math] of the preceding activations [math]\displaystyle{ y^l }[/math],

[math]\displaystyle{ \begin{align} z^l_i \mid y^l &\sim \mathcal{GP}\left( 0, \sigma^2_w K^l + \sigma^2_b \right) \\ K^l(x, x') &= \frac{1}{n^l} \sum_i y_i^l(x) y_i^l(x') \end{align} }[/math]

The effect of the weight scale [math]\displaystyle{ \sigma^2_w }[/math] is to rescale the contribution to the covariance matrix from [math]\displaystyle{ K^l }[/math], while the bias is shared for all inputs, and so [math]\displaystyle{ \sigma_b^2 }[/math] makes the [math]\displaystyle{ z^l_i }[/math] for different datapoints more similar and makes the covariance matrix more like a constant matrix.

[math]\displaystyle{ z^l | K^l }[/math] is a Gaussian process

The pre-activations [math]\displaystyle{ z^l }[/math] only depend on [math]\displaystyle{ y^l }[/math] through its second moment matrix [math]\displaystyle{ K^l }[/math]. Because of this, we can say that [math]\displaystyle{ z^l }[/math] is a Gaussian process conditioned on [math]\displaystyle{ K^l }[/math], rather than conditioned on [math]\displaystyle{ y^l }[/math],

[math]\displaystyle{ \begin{align} z^l_i \mid K^l &\sim \mathcal{GP}\left( 0, \sigma^2_w K^l + \sigma^2_b \right). \end{align} }[/math]

As layer width [math]\displaystyle{ n^l \rightarrow \infty }[/math], [math]\displaystyle{ K^l \mid K^{l-1} }[/math] becomes deterministic

As previously defined, [math]\displaystyle{ K^l }[/math] is the second moment matrix of [math]\displaystyle{ y^l }[/math]. Since [math]\displaystyle{ y^l }[/math] is the activation vector after applying the nonlinearity [math]\displaystyle{ \phi }[/math], it can be replaced by [math]\displaystyle{ \phi\left(z^{l-1}\right) }[/math], resulting in a modified equation expressing [math]\displaystyle{ K^l }[/math] for [math]\displaystyle{ l\gt 0 }[/math] in terms of [math]\displaystyle{ z^{l-1} }[/math],

[math]\displaystyle{ \begin{align} K^l(x, x') &= \frac{1}{n^l} \sum_i \phi\left( z^{l-1}_i(x) \right) \phi\left( z^{l-1}_i(x') \right) . \end{align} }[/math]

We have already determined that [math]\displaystyle{ z^{l-1} | K^{l-1} }[/math] is a Gaussian process. This means that the sum defining [math]\displaystyle{ K^l }[/math] is an average over [math]\displaystyle{ n^l }[/math] samples from a Gaussian process which is a function of [math]\displaystyle{ K^{l-1} }[/math],

[math]\displaystyle{ \begin{align} \left\{ z^{l-1}_i(x), z^{l-1}_i(x') \right\} &\sim \mathcal{GP}\left( 0, \sigma^2_w K^{l-1} + \sigma^2_b \right) . \end{align} }[/math]

As the layer width [math]\displaystyle{ n^l }[/math] goes to infinity, this average over [math]\displaystyle{ n^l }[/math] samples from the Gaussian process can be replaced with an integral over the Gaussian process:

[math]\displaystyle{ \begin{align} \lim_{n^l \rightarrow \infty} K^l(x, x') &= \int dz\, dz'\, \phi( z )\, \phi( z' )\, \mathcal{N}\left( \left[\begin{array}{c} z \\ z' \end{array}\right]; 0, \sigma^2_w \left[ \begin{array}{cc} K^{l-1}(x, x) & K^{l-1}(x, x') \\ K^{l-1}(x', x) & K^{l-1}(x', x') \end{array} \right] + \sigma^2_b \right) \end{align} }[/math]

So, in the infinite width limit the second moment matrix [math]\displaystyle{ K^l }[/math] for each pair of inputs [math]\displaystyle{ x }[/math] and [math]\displaystyle{ x' }[/math] can be expressed as an integral over a 2d Gaussian, of the product of [math]\displaystyle{ \phi(z) }[/math] and [math]\displaystyle{ \phi(z') }[/math]. There are a number of situations where this has been solved analytically, such as when [math]\displaystyle{ \phi(\cdot) }[/math] is a ReLU,[18] ELU, GELU,[19] or error function[1] nonlinearity. Even when it can't be solved analytically, since it is a 2d integral it can generally be efficiently computed numerically.[2] This integral is deterministic, so [math]\displaystyle{ K^l | K^{l-1} }[/math] is deterministic.

For shorthand, we define a functional [math]\displaystyle{ F }[/math], which corresponds to computing this 2d integral for all pairs of inputs, and which maps [math]\displaystyle{ K^{l-1} }[/math] into [math]\displaystyle{ K^l }[/math],

[math]\displaystyle{ \begin{align} \lim_{n^l \rightarrow \infty} K^l &= F\left( K^{l-1} \right) . \end{align} }[/math]

[math]\displaystyle{ z^L \mid x }[/math] is an NNGP

By recursively applying the observation that [math]\displaystyle{ K^l \mid K^{l-1} }[/math] is deterministic as [math]\displaystyle{ n^l \rightarrow \infty }[/math], [math]\displaystyle{ K^L }[/math] can be written as a deterministic function of [math]\displaystyle{ K^0 }[/math],

[math]\displaystyle{ \begin{align} \lim_{\min\left( n^1, \dots, n^L\right) \rightarrow \infty} K^L &= F \circ F \cdots \left( K^{0} \right) = F^L\left(K^0\right) , \end{align} }[/math]

where [math]\displaystyle{ F^L }[/math] indicates applying the functional [math]\displaystyle{ F }[/math] sequentially [math]\displaystyle{ L }[/math] times. By combining this expression with the further observations that the input layer second moment matrix [math]\displaystyle{ K^0(x,x')=\tfrac{1}{n^0} \sum_i x_i x'_i }[/math] is a deterministic function of the input [math]\displaystyle{ x }[/math], and that [math]\displaystyle{ z^L | K^L }[/math] is a Gaussian process, the output of the neural network can be expressed as a Gaussian process in terms of its input,

[math]\displaystyle{ \begin{align} z^L_i(x) &\sim \mathcal{GP}\left( 0, \sigma^2_w F^L\left(K^0\right) + \sigma^2_b \right) . \end{align} }[/math]

Software libraries

Neural Tangents is a free and open-source Python library used for computing and doing inference with the NNGP and neural tangent kernel corresponding to various common ANN architectures.[20]

References

  1. 1.0 1.1 Williams, Christopher K. I. (1997). "Computing with infinite networks". Neural Information Processing Systems. 
  2. 2.0 2.1 2.2 Lee, Jaehoon; Bahri, Yasaman; Novak, Roman; Schoenholz, Samuel S.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2017). "Deep Neural Networks as Gaussian Processes". International Conference on Learning Representations. Bibcode2017arXiv171100165L. 
  3. 3.0 3.1 G. de G. Matthews, Alexander; Rowland, Mark; Hron, Jiri; Turner, Richard E.; Ghahramani, Zoubin (2017). "Gaussian Process Behaviour in Wide Deep Neural Networks". International Conference on Learning Representations. Bibcode2018arXiv180411271M. 
  4. 4.0 4.1 4.2 4.3 Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018). "Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes". International Conference on Learning Representations. Bibcode2018arXiv181005148N. 
  5. 5.0 5.1 Garriga-Alonso, Adrià; Aitchison, Laurence; Rasmussen, Carl Edward (2018). "Deep Convolutional Networks as shallow Gaussian Processes". International Conference on Learning Representations. Bibcode2018arXiv180805587G. 
  6. 6.0 6.1 Borovykh, Anastasia (2018). "A Gaussian Process perspective on Convolutional Neural Networks". arXiv:1810.10798 [stat.ML].
  7. Tsuchida, Russell; Pearce, Tim; van der Heide, Christopher; Roosta, Fred; Gallagher, Marcus (2020). "Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks". arXiv:2002.08517 [cs.LG].
  8. 8.0 8.1 8.2 Yang, Greg (2019). "Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes". Advances in Neural Information Processing Systems. Bibcode2019arXiv191012478Y. https://papers.nips.cc/paper/9186-wide-feedforward-or-recurrent-neural-networks-of-any-architecture-are-gaussian-processes.pdf. 
  9. MacKay, David J. C. (1992). "A Practical Bayesian Framework for Backpropagation Networks". Neural Computation 4 (3): 448–472. doi:10.1162/neco.1992.4.3.448. ISSN 0899-7667. https://resolver.caltech.edu/CaltechAUTHORS:MACnc92b. 
  10. Neal, Radford M. (2012). Bayesian Learning for Neural Networks. Springer Science and Business Media. 
  11. Guo, Chuan; Pleiss, Geoff; Sun, Yu; Weinberger, Kilian Q. (2017). "On calibration of modern neural networks". Proceedings of the 34th International Conference on Machine Learning-Volume 70. 
  12. Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018-02-15). "Sensitivity and Generalization in Neural Networks: an Empirical Study". International Conference on Learning Representations. Bibcode2018arXiv180208760N. https://openreview.net/forum?id=HJC2SzZCW. 
  13. Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (2016-11-04). An Analysis of Deep Neural Network Models for Practical Applications. Bibcode2016arXiv160507678C. https://openreview.net/forum?id=Bygq-H9eg. 
  14. Neyshabur, Behnam; Li, Zhiyuan; Bhojanapalli, Srinadh; LeCun, Yann; Srebro, Nathan (2019). "Towards understanding the role of over-parametrization in generalization of neural networks". International Conference on Learning Representations. Bibcode2018arXiv180512076N. 
  15. Schoenholz, Samuel S.; Gilmer, Justin; Ganguli, Surya; Sohl-Dickstein, Jascha (2016). "Deep information propagation". International Conference on Learning Representations. 
  16. 16.0 16.1 Neal, Radford M. (1996), "Priors for Infinite Networks", Bayesian Learning for Neural Networks, Lecture Notes in Statistics, 118, Springer New York, pp. 29–53, doi:10.1007/978-1-4612-0745-0_2, ISBN 978-0-387-94724-2 
  17. Hron, Jiri; Bahri, Yasaman; Sohl-Dickstein, Jascha; Novak, Roman (2020-06-18). "Infinite attention: NNGP and NTK for deep attention networks". International Conference on Machine Learning 2020. Bibcode2020arXiv200610540H. 
  18. Cho, Youngmin; Saul, Lawrence K. (2009). "Kernel Methods for Deep Learning". Neural Information Processing Systems 22: 342–350. http://papers.nips.cc/paper/3628-kernel-methods-for-deep-. 
  19. Tsuchida, Russell; Pearce, Tim; van der Heide, Christopher; Roosta, Fred; Gallagher, Marcus (2020). "Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks". arXiv:2002.08517 [cs.LG].
  20. Novak, Roman; Xiao, Lechao; Hron, Jiri; Lee, Jaehoon; Alemi, Alexander A.; Sohl-Dickstein, Jascha; Schoenholz, Samuel S. (2019-12-05), "Neural Tangents: Fast and Easy Infinite Neural Networks in Python", International Conference on Learning Representations (ICLR) 2020, Bibcode2019arXiv191202803N