lt;SOS>$ start of sentence token - we repeatedly input into the decoder what's been previously generated - for 1st time, input is just lt;SOS>$ - in 2nd [[Multi Head Attention]] block, we take in the $Q$ query questions from the previous generation, & query from the input sequence info that we need - to help generate the next word - we also repeat this decoder $N$ times, i.e 6 - we generate one word at a time, one [[Softmax Activation Function|Softmax]] for each ADDITIONAL COMPONENTS - note we also use [[Positional Encoding]] on both the inputs of the encoder & the decoder - we also use Add Norm layers, which are very similar to [[Batch Normalization|Batch Norm Layers]] - helps speed up learning - we use [[Masked Multi Head Attention]] during training to hide future words in labelled correct output sequences, when predicting current word - we also use an Add & Norm layer - this step is crucial for efficient training & performance of the model 1. adds input to final output - uses a [[Residual Neural Networks|Residual Connection]] 2. after adding them, we pass that result through [[Layer Normalization]] for our final output - note the FF [[Deep Learning|NN]] layers are just a 2 layer [[Deep Learning|NN]] - $FFN(x)=max(0,xW_1+b_1)W_2+b2$ - [[ReLU Activation Function|ReLU Activation]] on 1st layer - 2 layers: hidden, then output ![[CleanShot 2024-07-10 at [email protected]]] ![[IMG_5136.png]] - note in above image, red line should be between Add & Norm and MASKED layers ![[Pasted image 20240710121126.png|500]] https://thenlpstudent.github.io/transformers-and-attention.html#what-is-the-add-and-norm-layernbsp https://nlp.seas.harvard.edu/annotated-transformer/#position-wise-feed-forward-networks