Radial basis function kernel: Difference between revisions

From HandWiki
link
 
linkage
 
Line 4: Line 4:
The RBF kernel on two samples <math>\mathbf{x},\mathbf{x'}\in \mathbb{R}^{k}</math>, represented as feature vectors in some ''input space'', is defined as<ref name="primer">Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf (2004). [https://cbio.ensmp.fr/~jvert/publi/04kmcbbook/kernelprimer.pdf "A primer on kernel methods".] ''Kernel Methods in Computational Biology''.</ref>
The RBF kernel on two samples <math>\mathbf{x},\mathbf{x'}\in \mathbb{R}^{k}</math>, represented as feature vectors in some ''input space'', is defined as<ref name="primer">Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf (2004). [https://cbio.ensmp.fr/~jvert/publi/04kmcbbook/kernelprimer.pdf "A primer on kernel methods".] ''Kernel Methods in Computational Biology''.</ref>


:<math>K(\mathbf{x}, \mathbf{x'}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{x'}\|^2}{2\sigma^2}\right)</math>
<math display="block">K(\mathbf{x}, \mathbf{x'}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{x'}\|^2}{2\sigma^2}\right)</math>


<math>\textstyle\|\mathbf{x} - \mathbf{x'}\|^2</math> may be recognized as the [[Euclidean distance#Squared Euclidean distance|squared Euclidean distance]] between the two feature vectors. <math>\sigma</math> is a free parameter. An equivalent definition involves a parameter <math>\textstyle\gamma = \tfrac{1}{2\sigma^2}</math>:
<math>\textstyle\|\mathbf{x} - \mathbf{x'}\|^2</math> may be recognized as the [[Euclidean distance#Squared Euclidean distance|squared Euclidean distance]] between the two feature vectors. <math>\sigma</math> is a [[Free parameter|free parameter]]. An equivalent definition involves a parameter <math>\textstyle\gamma = \tfrac{1}{2\sigma^2}</math>:


:<math>K(\mathbf{x}, \mathbf{x'}) = \exp(-\gamma\|\mathbf{x} - \mathbf{x'}\|^2)</math>
<math display="block">K(\mathbf{x}, \mathbf{x'}) = \exp(-\gamma\|\mathbf{x} - \mathbf{x'}\|^2)</math>


Since the value of the RBF kernel decreases with distance and ranges between zero (in the infinite-distance limit) and one (when {{math|'''x''' {{=}} '''x''''}}), it has a ready interpretation as a [[Similarity measure|similarity measure]].<ref name="primer"/>
Since the value of the RBF kernel decreases with distance and ranges between zero (in the infinite-distance limit) and one (when {{math|'''x''' {{=}} '''x''''}}), it has a ready interpretation as a [[Similarity measure|similarity measure]].<ref name="primer"/>
Line 20: Line 20:
}}</ref>
}}</ref>


:<math>
<math display="block">
\begin{alignat}{2}  
\begin{alignat}{2}  
\exp\left(-\frac{1}{2}\|\mathbf{x} - \mathbf{x'}\|^2\right)
\exp\left(-\frac{1}{2}\|\mathbf{x} - \mathbf{x'}\|^2\right)
&= \exp(\frac{2}{2}\mathbf{x}^\top \mathbf{x'} - \frac{1}{2}\|\mathbf{x}\|^2  - \frac{1}{2}\|\mathbf{x'}\|^2)\\[5pt]
&= \exp\left(\frac{2}{2}\mathbf{x}^\top \mathbf{x'} - \frac{1}{2}\|\mathbf{x}\|^2  - \frac{1}{2}\|\mathbf{x'}\|^2\right)\\[5pt]
&= \exp(\mathbf{x}^\top \mathbf{x'}) \exp( - \frac{1}{2}\|\mathbf{x}\|^2) \exp( - \frac{1}{2}\|\mathbf{x'}\|^2) \\[5pt]
&= \exp\left(\mathbf{x}^\top \mathbf{x'}\right) \exp\left( - \frac{1}{2}\|\mathbf{x}\|^2\right) \exp\left( - \frac{1}{2}\|\mathbf{x'}\|^2\right) \\[5pt]
&= \sum_{j=0}^\infty \frac{(\mathbf{x}^\top \mathbf{x'})^j}{j!} \exp\left(-\frac{1}{2}\|\mathbf{x}\|^2\right) \exp\left(-\frac{1}{2}\|\mathbf{x'}\|^2\right)\\[5pt]
&= \sum_{j=0}^\infty \frac{(\mathbf{x}^\top \mathbf{x'})^j}{j!} \exp\left(-\frac{1}{2}\|\mathbf{x}\|^2\right) \exp\left(-\frac{1}{2}\|\mathbf{x'}\|^2\right)\\[5pt]
&= \sum_{j=0}^\infty \quad \sum_{n_1+n_2+\dots +n_k=j}  
&= \sum_{j=0}^\infty \quad \sum_{n_1+n_2+\dots +n_k=j}  
Line 35: Line 35:
</math>
</math>


:<math>
<math display="block">
\varphi(\mathbf{x})
\varphi(\mathbf{x})
=
=
Line 41: Line 41:
\left(a^{(0)}_{\ell_0},a^{(1)}_1,\dots,a^{(1)}_{\ell_1},\dots,a^{(j)}_1,\dots,a^{(j)}_{\ell_j},\dots \right )
\left(a^{(0)}_{\ell_0},a^{(1)}_1,\dots,a^{(1)}_{\ell_1},\dots,a^{(j)}_1,\dots,a^{(j)}_{\ell_j},\dots \right )
</math>
</math>
where <math>\ell_j=\tbinom {k+j-1}{j}</math>,
where {{nowrap|<math>\ell_j=\tbinom {k+j-1}{j}</math>,}}
:<math>
<math display="block">
a^{(j)}_{\ell}=\frac{x_1^{n_1}\cdots x_k^{n_k} }{\sqrt{n_1! \cdots n_k! }} \quad|\quad n_1+n_2+\dots+n_k = j \wedge 1\leq \ell\leq \ell_j  
a^{(j)}_{\ell}=\frac{x_1^{n_1}\cdots x_k^{n_k} }{\sqrt{n_1! \cdots n_k! }} \quad|\quad n_1+n_2+\dots+n_k = j \wedge 1\leq \ell\leq \ell_j  
</math>
</math>
Line 49: Line 49:
Typically, these take the form of a function ''z'' that maps a single vector to a vector of higher dimensionality, approximating the kernel:
Typically, these take the form of a function ''z'' that maps a single vector to a vector of higher dimensionality, approximating the kernel:


:<math>\langle z(\mathbf{x}), z(\mathbf{x'}) \rangle \approx \langle \varphi(\mathbf{x}), \varphi(\mathbf{x'}) \rangle = K(\mathbf{x}, \mathbf{x'})</math>
<math display="block">\langle z(\mathbf{x}), z(\mathbf{x'}) \rangle \approx \langle \varphi(\mathbf{x}), \varphi(\mathbf{x'}) \rangle = K(\mathbf{x}, \mathbf{x'})</math>


where <math>\textstyle\varphi</math> is the implicit mapping embedded in the RBF kernel.
where <math>\textstyle\varphi</math> is the implicit mapping embedded in the RBF kernel.
Line 62: Line 62:
'''Proof:''' It suffices to prove the case of <math>D=1</math>. Use the trigonometric identity <math>\cos(a-b) = \cos(a)\cos(b) + \sin(a)\sin(b)</math>, the spherical symmetry of [[Gaussian distribution]], then evaluate the integral
'''Proof:''' It suffices to prove the case of <math>D=1</math>. Use the trigonometric identity <math>\cos(a-b) = \cos(a)\cos(b) + \sin(a)\sin(b)</math>, the spherical symmetry of [[Gaussian distribution]], then evaluate the integral


: <math>\int_{-\infty}^\infty \frac{\cos (k x) e^{-x^2 / 2}}{\sqrt{2 \pi}} d x=e^{-k^2 / 2}. </math>
<math display="block">\int_{-\infty}^\infty \frac{\cos (k x) e^{-x^2 / 2}}{\sqrt{2 \pi}} d x=e^{-k^2 / 2}. </math>


'''Theorem:''' <math>\operatorname{Var}[\langle \varphi(x), \varphi(y)\rangle] = O(D^{-1})</math>. (Appendix A.2<ref>{{Cite arXiv |last1=Peng |first1=Hao |last2=Pappas |first2=Nikolaos |last3=Yogatama |first3=Dani |last4=Schwartz |first4=Roy |last5=Smith |first5=Noah A. |last6=Kong |first6=Lingpeng |date=2021-03-19 |title=Random Feature Attention |class=cs.CL |eprint=2103.02143 }}</ref>).
'''Theorem:''' <math>\operatorname{Var}[\langle \varphi(x), \varphi(y)\rangle] = O(D^{-1})</math>. (Appendix A.2<ref>{{Cite arXiv |last1=Peng |first1=Hao |last2=Pappas |first2=Nikolaos |last3=Yogatama |first3=Dani |last4=Schwartz |first4=Roy |last5=Smith |first5=Noah A. |last6=Kong |first6=Lingpeng |date=2021-03-19 |title=Random Feature Attention |class=cs.CL |eprint=2103.02143 }}</ref>).


=== Nyström method ===
=== Nyström method ===
Another approach uses the [[Nyström method]] to approximate the eigendecomposition of the [[Gramian matrix|Gram matrix]] ''K'', using only a random sample of the training set.<ref>{{cite journal |author1=C.K.I. Williams |author2=M. Seeger |title=Using the Nyström method to speed up kernel machines |journal=Advances in Neural Information Processing Systems |year=2001 |volume=13 |url= https://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines}}</ref>
Another approach uses the [[Nyström method]] to approximate the eigendecomposition of the [[Gramian matrix|Gram matrix]] ''K'', using only a random sample of the [[Training, validation, and test data sets|training set]].<ref>{{cite journal |author1=C.K.I. Williams |author2=M. Seeger |title=Using the Nyström method to speed up kernel machines |journal=Advances in Neural Information Processing Systems |year=2001 |volume=13 |url= https://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines}}</ref>


==See also==
==See also==

Latest revision as of 10:03, 14 April 2026

Short description: Machine learning kernel function

In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.[1]

The RBF kernel on two samples 𝐱,𝐱k, represented as feature vectors in some input space, is defined as[2]

K(𝐱,𝐱)=exp(𝐱𝐱22σ2)

𝐱𝐱2 may be recognized as the squared Euclidean distance between the two feature vectors. σ is a free parameter. An equivalent definition involves a parameter γ=12σ2:

K(𝐱,𝐱)=exp(γ𝐱𝐱2)

Since the value of the RBF kernel decreases with distance and ranges between zero (in the infinite-distance limit) and one (when x = x'), it has a ready interpretation as a similarity measure.[2] The feature space of the kernel has an infinite number of dimensions; for σ=1, its expansion using the multinomial theorem is:[3]

exp(12𝐱𝐱2)=exp(22𝐱𝐱12𝐱212𝐱2)=exp(𝐱𝐱)exp(12𝐱2)exp(12𝐱2)=j=0(𝐱𝐱)jj!exp(12𝐱2)exp(12𝐱2)=j=0n1+n2++nk=jexp(12𝐱2)x1n1xknkn1!nk!exp(12𝐱2)x1n1xknkn1!nk!=φ(𝐱),φ(𝐱)

φ(𝐱)=exp(12𝐱2)(a0(0),a1(1),,a1(1),,a1(j),,aj(j),) where j=(k+j1j), a(j)=x1n1xknkn1!nk!|n1+n2++nk=j1j

Approximations

Because support vector machines and other models employing the kernel trick do not scale well to large numbers of training samples or large numbers of features in the input space, several approximations to the RBF kernel (and similar kernels) have been introduced.[4] Typically, these take the form of a function z that maps a single vector to a vector of higher dimensionality, approximating the kernel:

z(𝐱),z(𝐱)φ(𝐱),φ(𝐱)=K(𝐱,𝐱)

where φ is the implicit mapping embedded in the RBF kernel.

Fourier random features

One way to construct such a z is to randomly sample from the Fourier transformation of the kernel[5]φ(x)=1D[cosw1,x,sinw1,x,,coswD,x,sinwD,x]Twhere w1,...,wD are independent samples from the normal distribution N(0,σ2I).

Theorem: E[φ(x),φ(y)]=exy2/(2σ2).

Proof: It suffices to prove the case of D=1. Use the trigonometric identity cos(ab)=cos(a)cos(b)+sin(a)sin(b), the spherical symmetry of Gaussian distribution, then evaluate the integral

cos(kx)ex2/22πdx=ek2/2.

Theorem: Var[φ(x),φ(y)]=O(D1). (Appendix A.2[6]).

Nyström method

Another approach uses the Nyström method to approximate the eigendecomposition of the Gram matrix K, using only a random sample of the training set.[7]

See also

References

  1. Chang, Yin-Wen; Hsieh, Cho-Jui; Chang, Kai-Wei; Ringgaard, Michael; Lin, Chih-Jen (2010). "Training and testing low-degree polynomial data mappings via linear SVM". Journal of Machine Learning Research 11: 1471–1490. https://jmlr.org/papers/v11/chang10a.html. 
  2. 2.0 2.1 Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf (2004). "A primer on kernel methods". Kernel Methods in Computational Biology.
  3. Shashua, Amnon (2009). "Introduction to Machine Learning: Class Notes 67577". arXiv:0904.3664v1 [cs.LG].
  4. Andreas Müller (2012). Kernel Approximations for Efficient SVMs (and other feature extraction methods).
  5. Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems (Curran Associates, Inc.) 20. https://proceedings.neurips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html. 
  6. Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (2021-03-19). "Random Feature Attention". arXiv:2103.02143 [cs.CL].
  7. C.K.I. Williams; M. Seeger (2001). "Using the Nyström method to speed up kernel machines". Advances in Neural Information Processing Systems 13. https://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines.