Attention (machine learning)
In neural networks, attention is a technique that mimics cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus to the small, but important, parts of the data. Learning which part of the data is more important than another depends on the context, and this is trained by gradient descent.
Attentionlike mechanisms were introduced in the 1990s under names like multiplicative modules, sigma pi units, and hypernetworks.^{[1]} Its flexibility comes from its role as "soft weights" that can change during runtime, in contrast to standard weights that must remain fixed at runtime. Uses of attention include memory in neural Turing machines, reasoning tasks in differentiable neural computers,^{[2]} language processing in transformers, and multisensory data processing (sound, images, video, and text) in perceivers.^{[3]}^{[4]}^{[5]}^{[6]}
General idea
Given a sequence of tokens labeled by the index [math]\displaystyle{ i }[/math], a neural network computes a soft weight [math]\displaystyle{ w_i }[/math] for each token [math]\displaystyle{ i }[/math] with the property that [math]\displaystyle{ w_i }[/math] is nonnegative and [math]\displaystyle{ \sum_i w_i=1 }[/math]. Each token is assigned a value vector [math]\displaystyle{ v_i }[/math] which is computed from the Word embedding of the [math]\displaystyle{ i }[/math]th token. The weighted average [math]\displaystyle{ \sum_i w_i v_i }[/math] is the output of the attention mechanism.
The querykey mechanism computes the soft weights. From the word embedding of each token it computes its corresponding query vector [math]\displaystyle{ q_i }[/math] and key vector [math]\displaystyle{ k_i }[/math]. The weights are obtained by taking the Softmax function of the dot product [math]\displaystyle{ q_i k_j }[/math] where [math]\displaystyle{ i }[/math] represents the current token and [math]\displaystyle{ j }[/math] represents the token that's being attended to.
In some architectures, there are multiple heads of attention, each operating independently with their own queries, keys and values.
A language translation example
To build a machine that translates English to French, one takes the basic an EncoderDecoder and graft an attention unit to it (diagram below). In the simplest case, the attention unit can consists of dot products of the recurrent encoder states and does not need training. In practice, the attention unit consists of 3 fully connected neural network layers called QueryKeyValue that need to be trained. See the Variants section below.
Click here for the static image: == Summary == Submitted to commons.wikimedia.org LicensingThis file is licensed under the AttributionShare Alike 3.0 Unported (CC BYSA 3.0) license. You are free:
Under the following conditions:



Viewed as a matrix, the attention weights show how the network adjusts its focus according to context.
I  love  you  
je  .94  .02  .04 
t'  .11  .01  .88 
aime  .03  .95  .02 
This view of the attention weights addresses the "explainability" problem that neural networks are criticized for. Networks that perform verbatim translation without regard to word order would have a diagonally dominant matrix if they were analyzable in these terms. The offdiagonal dominance shows that the attention mechanism is more nuanced. On the first pass through the decoder, 94% of the attention weight is on the first English word "I", so the network offers the word "je". On the second pass of the decoder, 88% of the attention weight is on the third English word "you", so it offers "t'". On the last pass, 95% of the attention weight is on the second English word "love", so it offers "aime".
Variants
There are many variants of attention: dot product, querykeyvalue,^{[3]} hard, soft, self, cross, Luong,^{[8]} and Bahdanau^{[9]} to name a few. These variants recombine the encoderside inputs to redistribute those effects to each target output. Often, a correlationstyle matrix of dot products provides the reweighting coefficients (see legend).
1. encoderdecoder dot product  2. encoderdecoder QKV  3. encoderonly dot product  4. encoderonly QKV  5. Pytorch tutorial 

label  description 

variables X,H,S,T  upper case variables represent the entire sentence, and not just the current word. For example, H is a matrix of the encoder hidden state—one word per column. 
S, T  S = decoder hidden state, T = target word embedding. In the Pytorch Tutorial variant training phase, T alternates between 2 sources depending on the level of teacher forcing used. T could be the embedding of the network's output word, i.e. embedding(argmax(FC output)). Alternatively with teacher forcing, T could be the embedding of the known correct word which can occur with a constant forcing probability, say 1/2. 
X, H  H = encoder hidden state, X = input word embeddings 
W  attention coefficients 
Qw, Kw, Vw, FC  weight matrices for query, key, vector respectively. FC is a fully connected weight matrix. 
circled +, circled x  circled + = vector concatenation. circled x = matrix multiplication 
corr  column wise softmax( matrix of all combinations of dot products ). The dot products are x_{i} * x_{j} in variant # 3, h_{i} * s_{j} in variant 1, and column_{ i }( Kw*H ) * column_{ j }( Qw*S ) in variant 2, and column_{ i }(Kw*X) * column_{ j }(Qw*X) in variant 4. variant 5 uses a fully connected layer to determine the coefficients. If the variant is QKV, then the dot products are normalized by the sqrt(d) where d is the height of the QKV matrices. 
See also
 Transformer (machine learning model) § Scaled dotproduct attention
 Perceiver § Components for querykeyvalue (QKV) attention
References
 ↑ Yann Lecun (2020). Deep Learning course at NYU, Spring 2020, video lecture Week 6. Event occurs at 53:00. Retrieved 20220308.
 ↑ Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; GrabskaBarwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward et al. (20161012). "Hybrid computing using a neural network with dynamic external memory" (in en). Nature 538 (7626): 471–476. doi:10.1038/nature20101. ISSN 14764687. PMID 27732574. Bibcode: 2016Natur.538..471G. https://ora.ox.ac.uk/objects/uuid:dd8473bd2d70424d881b86d9c9c66b51.
 ↑ ^{3.0} ^{3.1} Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (20171205). "Attention Is All You Need". arXiv:1706.03762 [cs.CL].
 ↑ Ramachandran, Prajit; Parmar, Niki; Vaswani, Ashish; Bello, Irwan; Levskaya, Anselm; Shlens, Jonathon (20190613). "StandAlone SelfAttention in Vision Models". arXiv:1906.05909 [cs.CV].
 ↑ Jaegle, Andrew; Gimeno, Felix; Brock, Andrew; Zisserman, Andrew; Vinyals, Oriol; Carreira, Joao (20210622). "Perceiver: General Perception with Iterative Attention". arXiv:2103.03206 [cs.CV].
 ↑ Ray, Tiernan. "Google's Supermodel: DeepMind Perceiver is a step on the road to an AI machine that could process anything and everything" (in en). https://www.zdnet.com/article/googlessupermodeldeepmindperceiverisastepontheroadtoanaimachinethatcouldprocesseverything/.
 ↑ "Pytorch.org seq2seq tutorial". https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html.
 ↑ ^{8.0} ^{8.1} Luong, MinhThang (20150920). "Effective Approaches to Attentionbased Neural Machine Translation". arXiv:1508.04025v5 [cs.CL].
 ↑ Bahdanau, Dzmitry (20160519). "NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE". arXiv:1409.0473.
 ↑ Neil Rhodes (2021). CS 152 NN—27: Attention: Keys, Queries, & Values. Event occurs at 06:30. Retrieved 20211222.
 ↑ Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 05:30. Retrieved 20211222.
 ↑ Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 20:15. Retrieved 20211222.
 ↑ Robertson, Sean. "NLP From Scratch: Translation With a Sequence To Sequence Network and Attention". https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html.
External links
 Dan Jurafsky and James H. Martin (2022) Speech and Language Processing (3rd ed. draft, January 2022), ch. 10.4 Attention and ch. 9.7 SelfAttention Networks: Transformers
 Alex Graves (4 May 2020), Attention and Memory in Deep Learning (video lecture), DeepMind / UCL, via YouTube
 Rasa Algorithm Whiteboard  Attention via YouTube
Original source: https://en.wikipedia.org/wiki/Attention (machine learning).
Read more 