Restricted Boltzmann machine
Machine learning and data mining |
---|
A restricted Boltzmann machine (RBM) (also called a restricted Sherrington–Kirkpatrick model with external field or restricted stochastic Ising–Lenz–Little model) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.[1]
RBMs were initially proposed under the name Harmonium by Paul Smolensky in 1986,[2] and rose to prominence after Geoffrey Hinton and collaborators used fast learning algorithms for them in the mid-2000s. RBMs have found applications in dimensionality reduction,[3] classification,[4] collaborative filtering,[5] feature learning,[6] topic modelling[7] and even many body quantum mechanics.[8][9] They can be trained in either supervised or unsupervised ways, depending on the task.
As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units (commonly referred to as the "visible" and "hidden" units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, "unrestricted" Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.[10]
Restricted Boltzmann machines can also be used in deep learning networks. In particular, deep belief networks can be formed by "stacking" RBMs and optionally fine-tuning the resulting deep network with gradient descent and backpropagation.[11]
Structure
The standard type of RBM has binary-valued (Boolean) hidden and visible units, and consists of a matrix of weights [math]\displaystyle{ W }[/math] of size [math]\displaystyle{ m\times n }[/math]. Each weight element [math]\displaystyle{ (w_{i,j}) }[/math] of the matrix is associated with the connection between the visible (input) unit [math]\displaystyle{ v_i }[/math] and the hidden unit [math]\displaystyle{ h_j }[/math]. In addition, there are bias weights (offsets) [math]\displaystyle{ a_i }[/math] for [math]\displaystyle{ v_i }[/math] and [math]\displaystyle{ b_j }[/math] for [math]\displaystyle{ h_j }[/math]. Given the weights and biases, the energy of a configuration (pair of boolean vectors) (v,h) is defined as
- [math]\displaystyle{ E(v,h) = -\sum_i a_i v_i - \sum_j b_j h_j -\sum_i \sum_j v_i w_{i,j} h_j }[/math]
or, in matrix notation,
- [math]\displaystyle{ E(v,h) = -a^{\mathrm{T}} v - b^{\mathrm{T}} h -v^{\mathrm{T}} W h. }[/math]
This energy function is analogous to that of a Hopfield network. As with general Boltzmann machines, the joint probability distribution for the visible and hidden vectors is defined in terms of the energy function as follows,[12]
- [math]\displaystyle{ P(v,h) = \frac{1}{Z} e^{-E(v,h)} }[/math]
where [math]\displaystyle{ Z }[/math] is a partition function defined as the sum of [math]\displaystyle{ e^{-E(v,h)} }[/math] over all possible configurations, which can be interpreted as a normalizing constant to ensure that the probabilities sum to 1. The marginal probability of a visible vector is the sum of [math]\displaystyle{ P(v,h) }[/math] over all possible hidden layer configurations,[12]
- [math]\displaystyle{ P(v) = \frac{1}{Z} \sum_{\{h\}} e^{-E(v,h)} }[/math],
and vice versa. Since the underlying graph structure of the RBM is bipartite (meaning there are no intra-layer connections), the hidden unit activations are mutually independent given the visible unit activations. Conversely, the visible unit activations are mutually independent given the hidden unit activations.[10] That is, for m visible units and n hidden units, the conditional probability of a configuration of the visible units v, given a configuration of the hidden units h, is
- [math]\displaystyle{ P(v|h) = \prod_{i=1}^m P(v_i|h) }[/math].
Conversely, the conditional probability of h given v is
- [math]\displaystyle{ P(h|v) = \prod_{j=1}^n P(h_j|v) }[/math].
The individual activation probabilities are given by
- [math]\displaystyle{ P(h_j=1|v) = \sigma \left(b_j + \sum_{i=1}^m w_{i,j} v_i \right) }[/math] and [math]\displaystyle{ \,P(v_i=1|h) = \sigma \left(a_i + \sum_{j=1}^n w_{i,j} h_j \right) }[/math]
where [math]\displaystyle{ \sigma }[/math] denotes the logistic sigmoid.
The visible units of Restricted Boltzmann Machine can be multinomial, although the hidden units are Bernoulli.[clarification needed] In this case, the logistic function for visible units is replaced by the softmax function
- [math]\displaystyle{ P(v_i^k = 1|h) = \frac{\exp(a_i^k + \Sigma_j W_{ij}^k h_j)} {\Sigma_{k'=1}^K \exp(a_i^{k'} + \Sigma_j W_{ij}^{k'} h_j)} }[/math]
where K is the number of discrete values that the visible values have. They are applied in topic modeling,[7] and recommender systems.[5]
Relation to other models
Restricted Boltzmann machines are a special case of Boltzmann machines and Markov random fields.[13][14] Their graphical model corresponds to that of factor analysis.[15]
Training algorithm
Restricted Boltzmann machines are trained to maximize the product of probabilities assigned to some training set [math]\displaystyle{ V }[/math] (a matrix, each row of which is treated as a visible vector [math]\displaystyle{ v }[/math]),
- [math]\displaystyle{ \arg\max_W \prod_{v \in V} P(v) }[/math]
or equivalently, to maximize the expected log probability of a training sample [math]\displaystyle{ v }[/math] selected randomly from [math]\displaystyle{ V }[/math]:[13][14]
- [math]\displaystyle{ \arg\max_W \mathbb{E} \left[ \log P(v)\right] }[/math]
The algorithm most often used to train RBMs, that is, to optimize the weight matrix [math]\displaystyle{ W }[/math], is the contrastive divergence (CD) algorithm due to Hinton, originally developed to train PoE (product of experts) models.[16][17] The algorithm performs Gibbs sampling and is used inside a gradient descent procedure (similar to the way backpropagation is used inside such a procedure when training feedforward neural nets) to compute weight update.
The basic, single-step contrastive divergence (CD-1) procedure for a single sample can be summarized as follows:
- Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.
- Compute the outer product of v and h and call this the positive gradient.
- From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)
- Compute the outer product of v' and h' and call this the negative gradient.
- Let the update to the weight matrix [math]\displaystyle{ W }[/math] be the positive gradient minus the negative gradient, times some learning rate: [math]\displaystyle{ \Delta W = \epsilon (vh^\mathsf{T} - v'h'^\mathsf{T}) }[/math].
- Update the biases a and b analogously: [math]\displaystyle{ \Delta a = \epsilon (v - v') }[/math], [math]\displaystyle{ \Delta b = \epsilon (h - h') }[/math].
A Practical Guide to Training RBMs written by Hinton can be found on his homepage.[12]
Stacked Restricted Boltzmann Machine
- The difference between the Stacked Restricted Boltzmann Machines and RBM is that RBM has lateral connections within a layer that are prohibited to make analysis tractable. On the other hand, the Stacked Boltzmann consists of a combination of an unsupervised three-layer network with symmetric weights and a supervised fine-tuned top layer for recognizing three classes.
- The usage of Stacked Boltzmann is to understand Natural languages, retrieve documents, image generation, and classification. These functions are trained with unsupervised pre-training and/or supervised fine-tuning. Unlike the undirected symmetric top layer, with a two-way unsymmetric layer for connection for RBM. The restricted Boltzmann's connection is three-layers with asymmetric weights, and two networks are combined into one.
- Stacked Boltzmann does share similarities with RBM, the neuron for Stacked Boltzmann is a stochastic binary Hopfield neuron, which is the same as the Restricted Boltzmann Machine. The energy from both Restricted Boltzmann and RBM is given by Gibb's probability measure: [math]\displaystyle{ E = -\frac12\sum_{i,j}{w_{ij}{s_i}{s_j}}+\sum_i{\theta_i}{s_i} }[/math]. The training process of Restricted Boltzmann is similar to RBM. Restricted Boltzmann train one layer at a time and approximate equilibrium state with a 3-segment pass, not performing back propagation. Restricted Boltzmann uses both supervised and unsupervised on different RBM for pre-training for classification and recognition. The training uses contrastive divergence with Gibbs sampling: Δwij = e*(pij - p'ij)
- The restricted Boltzmann's strength is it performs a non-linear transformation so it's easy to expand, and can give a hierarchical layer of features. The Weakness is that it has complicated calculations of integer and real-valued neurons. It does not follow the gradient of any function, so the approximation of Contrastive divergence to maximum likelihood is improvised.[12]
Literature
- Fischer, Asja; Igel, Christian (2012), "An Introduction to Restricted Boltzmann Machines", Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Lecture Notes in Computer Science (Berlin, Heidelberg: Springer Berlin Heidelberg) 7441: pp. 14–36, doi:10.1007/978-3-642-33275-3_2, ISBN 978-3-642-33274-6
See also
References
- ↑ Sherrington, David; Kirkpatrick, Scott (1975), "Solvable Model of a Spin-Glass", Physical Review Letters 35 (35): 1792–1796, doi:10.1103/PhysRevLett.35.1792, Bibcode: 1975PhRvL..35.1792S
- ↑ Smolensky, Paul (1986). "Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory". in Rumelhart, David E.; McLelland, James L.. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations. MIT Press. pp. 194–281. ISBN 0-262-68053-X. https://stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap6_PDP86.pdf.
- ↑ Hinton, G. E.; Salakhutdinov, R. R. (2006). "Reducing the Dimensionality of Data with Neural Networks". Science 313 (5786): 504–507. doi:10.1126/science.1127647. PMID 16873662. Bibcode: 2006Sci...313..504H. http://www.cs.toronto.edu/~hinton/science.pdf. Retrieved 2015-12-02.
- ↑ Larochelle, H.; Bengio, Y. (2008). "Classification using discriminative restricted Boltzmann machines". Proceedings of the 25th international conference on Machine learning - ICML '08. pp. 536. doi:10.1145/1390156.1390224. ISBN 9781605582054. http://machinelearning.org/archive/icml2008/papers/601.pdf.
- ↑ 5.0 5.1 Salakhutdinov, R.; Mnih, A.; Hinton, G. (2007). "Restricted Boltzmann machines for collaborative filtering". Proceedings of the 24th international conference on Machine learning - ICML '07. pp. 791. doi:10.1145/1273496.1273596. ISBN 9781595937933.
- ↑ Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). "An analysis of single-layer networks in unsupervised feature learning". International Conference on Artificial Intelligence and Statistics (AISTATS). http://cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf. Retrieved 2014-12-19.
- ↑ 7.0 7.1 Ruslan Salakhutdinov and Geoffrey Hinton (2010). Replicated softmax: an undirected topic model . Neural Information Processing Systems 23.
- ↑ Carleo, Giuseppe; Troyer, Matthias (2017-02-10). "Solving the quantum many-body problem with artificial neural networks" (in en). Science 355 (6325): 602–606. doi:10.1126/science.aag2302. ISSN 0036-8075. PMID 28183973. Bibcode: 2017Sci...355..602C.
- ↑ Melko, Roger G.; Carleo, Giuseppe; Carrasquilla, Juan; Cirac, J. Ignacio (September 2019). "Restricted Boltzmann machines in quantum physics" (in en). Nature Physics 15 (9): 887–892. doi:10.1038/s41567-019-0545-1. ISSN 1745-2481. Bibcode: 2019NatPh..15..887M.
- ↑ 10.0 10.1 Miguel Á. Carreira-Perpiñán and Geoffrey Hinton (2005). On contrastive divergence learning. Artificial Intelligence and Statistics.
- ↑ Hinton, G. (2009). "Deep belief networks". Scholarpedia 4 (5): 5947. doi:10.4249/scholarpedia.5947. Bibcode: 2009SchpJ...4.5947H.
- ↑ 12.0 12.1 12.2 12.3 Geoffrey Hinton (2010). A Practical Guide to Training Restricted Boltzmann Machines. UTML TR 2010–003, University of Toronto.
- ↑ 13.0 13.1 Sutskever, Ilya; Tieleman, Tijmen (2010). "On the convergence properties of contrastive divergence". Proc. 13th Int'l Conf. On AI and Statistics (AISTATS). http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_SutskeverT10.pdf.
- ↑ 14.0 14.1 Asja Fischer and Christian Igel. Training Restricted Boltzmann Machines: An Introduction . Pattern Recognition 47, pp. 25-39, 2014
- ↑ María Angélica Cueto; Jason Morton; Bernd Sturmfels (2010). "Geometry of the restricted Boltzmann machine". Algebraic Methods in Statistics and Probability (American Mathematical Society) 516. Bibcode: 2009arXiv0908.4425A.
- ↑ Geoffrey Hinton (1999). Products of Experts. ICANN 1999.
- ↑ Hinton, G. E. (2002). "Training Products of Experts by Minimizing Contrastive Divergence". Neural Computation 14 (8): 1771–1800. doi:10.1162/089976602760128018. PMID 12180402. http://www.cs.toronto.edu/~hinton/absps/tr00-004.pdf.
External links
- Introduction to Restricted Boltzmann Machines. Edwin Chen's blog, July 18, 2011.
- "A Beginner's Guide to Restricted Boltzmann Machines". https://deeplearning4j.org/restrictedboltzmannmachine.html.. Deeplearning4j Documentation
- "Understanding RBMs". http://deeplearning4j.org/understandingRBMs.html.. Deeplearning4j Documentation
- Python implementation of Bernoulli RBM and tutorial
- SimpleRBM is a very small RBM code (24kB) useful for you to learn about how RBMs learn and work.
- Julia implementation of Restricted Boltzmann machines: https://github.com/cossio/RestrictedBoltzmannMachines.jl
Original source: https://en.wikipedia.org/wiki/Restricted Boltzmann machine.
Read more |