Attention (machine learning)
Machine learning-based attention is a mechanism which intuitively mimicks cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. These weights can be computed either in parallel (such as in transformers) or sequentially (such as recurrent neural networks). "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.
Attention was developed to address the weaknesses of leveraging information from the hidden outputs of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence is expected to be attenuated. Attention allows the calculation of the hidden representation of a token equal access to any part of a sentence directly, rather than only through the previous hidden state.
Earlier uses attached this mechanism to a serial recurrent neural network's language translation system (below), but later uses in Transformers large language models removed the recurrent neural network and relied heavily on the faster parallel attention scheme.
Predecessors
Predecessors of the mechanism were used in recurrent neural networks which, however, calculated "soft" weights sequentially and, at each step, considered the current word and other words within the context window. They were known as multiplicative modules, sigma pi units,[1] and hyper-networks.[2] They have been used in long short-term memory (LSTM) networks, multi-sensory data processing (sound, images, video, and text) in perceivers, fast weight controller's memory,[3] reasoning tasks in differentiable neural computers, and neural Turing machines.[4][5][6][7][8]
Core calculations
The attention network was designed to identify the highest correlations amongst words within a sentence, assuming that it has learned those patterns from the training corpus. This correlation is captured in neuronal weights through back-propagation either from self-supervised pretraining or supervised fine-tuning.
The example below shows how correlations are identified once a network has been trained and has the right weights. When looking at the word "that" in the sentence "see that girl run", the network should be able to identify "girl" as a highly correlated word. For simplicity this example focuses on the word "that", but in reality all words receive this treatment in parallel and the resulting soft-weights and context vectors are stacked into matrices for further task-specific use.
The sentence is sent through 3 parallel streams (left), which emerge at the end as the context vector (right). The word embedding size is 300 and the neuron count is 100 in each sub-network of the attention head.
- The capital letter X denotes a matrix sized 4 × 300, consisting of the embeddings of all four words.
- The small underlined letter x denotes the embedding vector (sized 300) of the word "that".
- The attention head includes three (vertically arranged in the illustration) sub-networks, each having 100 neurons with a weight matrix sized 300 × 100.
- The asterisk within parenthesis "(*)" denotes the softmax( qKT / √100 ), i.e. not yet multiplied by the matrix V.
- Rescaling by √100 prevents a high variance in qKT that would allow a single word to excessively dominate the softmax resulting in attention to only one word, as a discrete hard max would do.
Notation: the commonly written row-wise softmax formula above assumes that vectors are rows, which contradicts the standard math notation of column vectors. More correctly, we should take the transpose of the context vector and use the column-wise softmax, resulting in the more correct form
The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words.[9]
The structure of the input data is captured in the Qw and Kw weights, and the Vw weights express that structure in terms of more meaningful features for the task being trained for. For this reason, the attention head components are called Query (Q), Key (K), and Value (V)—a loose and possibly misleading analogy with relational database systems.
Note that the context vector for "that" does not rely on context vectors for the other words; therefore the context vectors of all words can be calculated using the whole matrix X, which includes all the word embeddings, instead of a single word's embedding vector x in the formula above, thus parallelizing the calculations. Now, the softmax can be interpreted as a matrix softmax acting on separate rows. This is a huge advantage over recurrent networks which must operate sequentially.
A language translation example
To build a machine that translates English to French, an attention unit is grafted to the basic Encoder-Decoder (diagram below). In the simplest case, the attention unit consists of dot products of the recurrent encoder states and does not need training. In practice, the attention unit consists of 3 trained, fully-connected neural network layers called query, key, and value.
[[File:attention-1-sn.png
|700px| Encoder-decoder with attention [10]. The left part (black lines) is the encoder-decoder, the middle part (orange lines) is the attention unit, and the right part (in grey & colors) is the computed data. Grey regions in H matrix and w vector are zero values. Numerical subscripts indicate vector sizes while lettered subscripts i and i − 1 indicate time steps.]]
Viewed as a matrix, the attention weights show how the network adjusts its focus according to context [12].
I | love | you | |
je | 0.94 | 0.02 | 0.04 |
t' | 0.11 | 0.01 | 0.88 |
aime | 0.03 | 0.95 | 0.02 |
This view of the attention weights addresses the neural network "explainability" problem. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced. On the first pass through the decoder, 94% of the attention weight is on the first English word "I", so the network offers the word "je". On the second pass of the decoder, 88% of the attention weight is on the third English word "you", so it offers "t'". On the last pass, 95% of the attention weight is on the second English word "love", so it offers "aime".
Variants
Many variants of attention implement soft weights, such as
- "internal spotlights of attention"[13] generated by fast weight programmers or fast weight controllers (1992)[3] (also known as transformers with "linearized self-attention"[14][15]). A slow neural network learns by gradient descent to program the fast weights of another neural network through outer products of self-generated activation patterns called "FROM" and "TO" which in transformer terminology are called "key" and "value." This fast weight "attention mapping" is applied to queries.
- Bahdanau-style attention,[12] also referred to as additive attention,
- Luong-style attention,[16] which is known as multiplicative attention,
- highly parallelizable self-attention introduced in 2016 as decomposable attention [17] and successfully used in transformers a year later.
For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention,[18] channel attention,[19] or combinations.[20][21]
These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients.
1. encoder-decoder dot product | 2. encoder-decoder QKV | 3. encoder-only dot product | 4. encoder-only QKV | 5. Pytorch tutorial |
---|---|---|---|---|
![]() Both encoder & decoder are needed to calculate attention.[16] |
![]() Both encoder & decoder are needed to calculate attention.[22] |
![]() Decoder is not used to calculate attention. With only 1 input into corr, W is an auto-correlation of dot products. wij = xi xj[23] |
![]() Decoder is not used to calculate attention.[24] |
![]() A fully-connected layer is used to calculate attention instead of dot product correlation.[25] |
Mathematical representation
Standard Scaled Dot-Product Attention
[math]\displaystyle{ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)*V }[/math] where [math]\displaystyle{ Q, K, V }[/math] are the query, key, and value matrices, [math]\displaystyle{ d_k }[/math] is the dimension of the keys, and [math]\displaystyle{ * }[/math] is a dot product. Value vectors in matrix [math]\displaystyle{ V }[/math] are weighted using the weights resulting from the softmax operation.
Multi-Head Attention
[math]\displaystyle{ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O }[/math] where each head is computed as: [math]\displaystyle{ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) }[/math] and [math]\displaystyle{ W_i^Q, W_i^K, W_i^V }[/math], and [math]\displaystyle{ W^O }[/math] are parameter matrices.
Bahdanau (Additive) Attention
[math]\displaystyle{ \text{Attention}(Q, K, V) = \text{softmax}(e)V }[/math] where [math]\displaystyle{ e = \tanh(W_QQ + W_KK) }[/math] and [math]\displaystyle{ W_Q }[/math] and [math]\displaystyle{ W_K }[/math] are learnable weight matrices.[12]
Luong Attention (General)
[math]\displaystyle{ \text{Attention}(Q, K, V) = \text{softmax}(QW_aK^T)V }[/math] where [math]\displaystyle{ W_a }[/math] is a learnable weight matrix.[26]
See also
References
- ↑ Rumelhart, David E.; Mcclelland, James L.; Group, PDP Research (1987-07-29) (in en). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2. Cambridge, Mass: Bradford Books. ISBN 978-0-262-68053-0. https://stanford.edu/~jlmcc/papers/PDP/Chapter2.pdf.
- ↑ Yann Lecun (2020). Deep Learning course at NYU, Spring 2020, video lecture Week 6. Event occurs at 53:00. Retrieved 2022-03-08.
- ↑ Jump up to: 3.0 3.1 Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets.". Neural Computation 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131.
- ↑ Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward et al. (2016-10-12). "Hybrid computing using a neural network with dynamic external memory" (in en). Nature 538 (7626): 471–476. doi:10.1038/nature20101. ISSN 1476-4687. PMID 27732574. Bibcode: 2016Natur.538..471G. https://ora.ox.ac.uk/objects/uuid:dd8473bd-2d70-424d-881b-86d9c9c66b51.
- ↑ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need". Advances in Neural Information Processing Systems (Curran Associates, Inc.) 30. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- ↑ Ramachandran, Prajit; Parmar, Niki; Vaswani, Ashish; Bello, Irwan; Levskaya, Anselm; Shlens, Jonathon (2019-06-13). "Stand-Alone Self-Attention in Vision Models". arXiv:1906.05909 [cs.CV].
- ↑ Jaegle, Andrew; Gimeno, Felix; Brock, Andrew; Zisserman, Andrew; Vinyals, Oriol; Carreira, Joao (2021-06-22). "Perceiver: General Perception with Iterative Attention". arXiv:2103.03206 [cs.CV].
- ↑ Ray, Tiernan. "Google's Supermodel: DeepMind Perceiver is a step on the road to an AI machine that could process anything and everything" (in en). https://www.zdnet.com/article/googles-supermodel-deepmind-perceiver-is-a-step-on-the-road-to-an-ai-machine-that-could-process-everything/.
- ↑ Vaswan, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention Is All You Need". NeurIPS: 4. https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Retrieved August 11, 2023.
- ↑ Britz, Denny; Goldie, Anna; Luong, Minh-Thanh; Le, Quoc (2017-03-21). "Massive Exploration of Neural Machine Translation Architectures". arXiv:1703.03906 [cs.CV].
- ↑ "Pytorch.org seq2seq tutorial". https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html.
- ↑ Jump up to: 12.0 12.1 12.2 Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473.
- ↑ Schmidhuber, Jürgen (1993). "Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets". Springer. pp. 460–463.
- ↑ Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". Springer. pp. 9355–9366.
- ↑ Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Mohiuddin, Afroz; Kaiser, Lukasz; Belanger, David; Colwell, Lucy; Weller, Adrian (2020). "Rethinking Attention with Performers". arXiv:2009.14794 [cs.CL].
- ↑ Jump up to: 16.0 16.1 Luong, Minh-Thang (2015-09-20). "Effective Approaches to Attention-Based Neural Machine Translation". arXiv:1508.04025v5 [cs.CL].
- ↑ "Papers with Code - A Decomposable Attention Model for Natural Language Inference". https://paperswithcode.com/paper/a-decomposable-attention-model-for-natural.
- ↑ Zhu, Xizhou; Cheng, Dazhi; Zhang, Zheng; Lin, Stephen; Dai, Jifeng (2019). "An Empirical Study of Spatial Attention Mechanisms in Deep Networks". 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6687–6696. doi:10.1109/ICCV.2019.00679. ISBN 978-1-7281-4803-8. https://ieeexplore.ieee.org/document/9009578.
- ↑ Hu, Jie; Shen, Li; Sun, Gang (2018). "Squeeze-and-Excitation Networks". 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7132–7141. doi:10.1109/CVPR.2018.00745. ISBN 978-1-5386-6420-9. https://ieeexplore.ieee.org/document/8578843.
- ↑ Woo, Sanghyun; Park, Jongchan; Lee, Joon-Young; Kweon, In So (2018-07-18). "CBAM: Convolutional Block Attention Module". arXiv:1807.06521 [cs.CV].
- ↑ Georgescu, Mariana-Iuliana; Ionescu, Radu Tudor; Miron, Andreea-Iuliana; Savencu, Olivian; Ristea, Nicolae-Catalin; Verga, Nicolae; Khan, Fahad Shahbaz (2022-10-12). "Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution". arXiv:2204.04218 [eess.IV].
- ↑ Neil Rhodes (2021). CS 152 NN—27: Attention: Keys, Queries, & Values. Event occurs at 06:30. Retrieved 2021-12-22.
- ↑ Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 05:30. Retrieved 2021-12-22.
- ↑ Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 20:15. Retrieved 2021-12-22.
- ↑ Robertson, Sean. "NLP From Scratch: Translation With a Sequence To Sequence Network and Attention". https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html.
- ↑ Luong, T., Pham, H., & Manning, C.D. (2015). Effective Approaches to Attention-based Neural Machine Translation. ArXiv, abs/1508.04025.
External links
- Dan Jurafsky and James H. Martin (2022) Speech and Language Processing (3rd ed. draft, January 2022), ch. 10.4 Attention and ch. 9.7 Self-Attention Networks: Transformers
- Alex Graves (4 May 2020), Attention and Memory in Deep Learning (video lecture), DeepMind / UCL, via YouTube
- Rasa Algorithm Whiteboard - Attention via YouTube
![]() | Original source: https://en.wikipedia.org/wiki/Attention (machine learning).
Read more |