Highway network
In machine learning, the Highway Network was the first working very deep feedforward neural network with hundreds of layers, much deeper than previous artificial neural networks.[1][2][3]
It uses skip connections modulated by learned gating mechanisms to regulate information flow, inspired by Long Short-Term Memory (LSTM) recurrent neural networks.[4][5]
The advantage of a Highway Network over the common deep neural networks is that it solves or partially prevents the vanishing gradient problem,[6] thus leading to easier to optimize neural networks.
The gating mechanisms facilitate information flow across many layers ("information highways").[1][2]
Highway Networks have been used as part of text sequence labeling and speech recognition tasks.[7][8] An open-gated or gateless Highway Network variant called Residual neural network[9] was used to win the ImageNet 2015 competition. This has become the most cited neural network of the 21st century.[3]
Model
The model has two gates in addition to the H(WH, x) gate: the transform gate T(WT, x) and the carry gate C(WC, x). Those two last gates are non-linear transfer functions (by convention Sigmoid function). The H(WH, x) function can be any desired transfer function.
The carry gate is defined as C(WC, x) = 1 - T(WT, x). While the transform gate is just a gate with a sigmoid transfer function.
Structure
The structure of a hidden layer follows the equation:
[math]\displaystyle{ \begin{align} y = H(x,W_{H}) \centerdot T(x,W_{T}) + x \centerdot C(x,W_{C}) = H(x,W_{H}) \centerdot T(x,W_{T}) + x \centerdot (1 - T(x,W_{T})) \end{align} }[/math]
Related Work
Sepp Hochreiter analyzed the vanishing gradient problem in 1991 and attributed to it the reason why deep learning did not work well.[6] To overcome this problem, Long Short-Term Memory (LSTM) recurrent neural networks[4] have residual connections with a weight of 1.0 in every LSTM cell (called the constant error carrousel) to compute [math]\displaystyle{ y_{t+1} = F(x_{t}) + x_t }[/math]. During backpropagation through time, this becomes the residual formula [math]\displaystyle{ y = F(x) + x }[/math] for feedforward neural networks. This enables training very deep recurrent neural networks with a very long time span t. A later LSTM version published in 2000[5] modulates the identity LSTM connections by so-called "forget gates" such that their weights are not fixed to 1.0 but can be learned. In experiments, the forget gates were initialized with positive bias weights,[5] thus being opened, addressing the vanishing gradient problem. As long as the forget gates of the 2000 LSTM are open, it behaves like the 1997 LSTM.
The Highway Network of May 2015[1] applies these principles to feedforward neural networks. It was reported to be "the first very deep feedforward network with hundreds of layers".[10] It is like a 2000 LSTM with forget gates unfolded in time,[5] while the later Residual Nets have no equivalent of forget gates and are like the unfolded original 1997 LSTM.[4] If the skip connections in Highway Networks are "without gates," or if their gates are kept open (activation 1.0), they become Residual Networks.
The original Highway Network paper[1] not only introduced the basic principle for very deep feedforward networks, but also included experimental results with 20, 50, and 100 layers networks, and mentioned ongoing experiments with up to 900 layers.
References
- ↑ 1.0 1.1 1.2 1.3 Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2 May 2015). "Highway Networks". arXiv:1505.00387 [cs.LG].
- ↑ 2.0 2.1 Srivastava, Rupesh K; Greff, Klaus; Schmidhuber, Juergen (2015). "Training Very Deep Networks". Advances in Neural Information Processing Systems (Curran Associates, Inc.) 28: 2377–2385. http://papers.nips.cc/paper/5850-training-very-deep-networks.
- ↑ 3.0 3.1 Schmidhuber, Jürgen (2021). "The most cited neural networks all build on work done in my labs". AI Blog (IDSIA, Switzerland). https://people.idsia.ch/~juergen/most-cited-neural-nets.html.
- ↑ 4.0 4.1 4.2 Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". Neural Computation 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276. https://www.researchgate.net/publication/13853244.
- ↑ 5.0 5.1 5.2 5.3 Felix A. Gers; Jürgen Schmidhuber; Fred Cummins (2000). "Learning to Forget: Continual Prediction with LSTM". Neural Computation 12 (10): 2451–2471. doi:10.1162/089976600300015015. PMID 11032042.
- ↑ 6.0 6.1 Hochreiter, Sepp (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (diploma thesis). Technical University Munich, Institute of Computer Science, advisor: J. Schmidhuber.
- ↑ Liu, Liyuan; Shang, Jingbo; Xu, Frank F.; Ren, Xiang; Gui, Huan; Peng, Jian; Han, Jiawei (12 September 2017). "Empower Sequence Labeling with Task-Aware Neural Language Model". arXiv:1709.04109 [cs.CL].
- ↑ Kurata, Gakuto; Ramabhadran, Bhuvana; Saon, George; Sethy, Abhinav (19 September 2017). "Language Modeling with Highway LSTM". arXiv:1709.06436 [cs.CL].
- ↑ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Deep Residual Learning for Image Recognition". Las Vegas, NV, USA: IEEE. 770–778. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1. https://ieeexplore.ieee.org/document/7780459.
- ↑ Schmidhuber, Jürgen (2015). "Highway Networks (May 2015): First Working Really Deep Feedforward Neural Networks With Over 100 Layers". https://people.idsia.ch/~juergen/highway-networks.html.
Original source: https://en.wikipedia.org/wiki/Highway network.
Read more |