Transformer - Brendan Shih

## Concepts --- - [[Self Attention]] - [[Multi Head Attention]] - [[Positional Encoding]] - [[Masked Multi Head Attention]] ## Models --- - [[Sequence Models]] - [[Attention Model]] ## Intuition --- - [[Recurrent Neural Networks|RNN]] & [[LSTM]] ingest input one word at a time, Transformers enable us to ingest all parts of input in parallel - 2017 Attention Is All You Need paper introduces Transformer - combines use of [[Attention Model|Attention]] & [[Convolutional Neural Networks|CNN]] style of processing - uses [[Convolutional Neural Networks|CNN]]'s style of parallel processing ## General --- PRIMARY COMPONENTS: - we have an encoder below, which parses useful info from the input - 1st we get the [[Multi Head Attention]] context encodings - then a [[Deep Learning|FC Layer]] parses useful/interesting features from the encodings - we repeat this encoder $N$ times, typical value is $N = 6$ - then we feed the final encoder output to a decoder block - decoder block's job is to output the correct output sequence - 1st output is lt;SOS>$ start of sentence token - we repeatedly input into the decoder what's been previously generated - for 1st time, input is just lt;SOS>$ - in 2nd [[Multi Head Attention]] block, we take in the $Q$ query questions from the previous generation, & query from the input sequence info that we need - to help generate the next word - we also repeat this decoder $N$ times, i.e 6 - we generate one word at a time, one [[Softmax Activation Function|Softmax]] for each ADDITIONAL COMPONENTS - note we also use [[Positional Encoding]] on both the inputs of the encoder & the decoder - we also use Add Norm layers, which are very similar to [[Batch Normalization|Batch Norm Layers]] - helps speed up learning - we use [[Masked Multi Head Attention]] during training to hide future words in labelled correct output sequences, when predicting current word - we also use an Add & Norm layer - this step is crucial for efficient training & performance of the model 1. adds input to final output - uses a [[Residual Neural Networks|Residual Connection]] 2. after adding them, we pass that result through [[Layer Normalization]] for our final output - note the FF [[Deep Learning|NN]] layers are just a 2 layer [[Deep Learning|NN]] - $FFN(x)=max(0,xW_1+b_1)W_2+b2$ - [[ReLU Activation Function|ReLU Activation]] on 1st layer - 2 layers: hidden, then output ![[CleanShot 2024-07-10 at [email protected]]] ![[IMG_5136.png]] - note in above image, red line should be between Add & Norm and MASKED layers ![[Pasted image 20240710121126.png|500]] https://thenlpstudent.github.io/transformers-and-attention.html#what-is-the-add-and-norm-layernbsp https://nlp.seas.harvard.edu/annotated-transformer/#position-wise-feed-forward-networks