Attention Model - Brendan Shih

## Intuition --- - with super long sequences like possible in [[Recurrent Neural Networks|RNNs]], if you have a massive paragraph as input, a lot of its detail will get lost - this is because you are processing the entire input before generating any output - meaning that the entire input is squeezed into one hidden unit, causing loss of information - attention lets us look at different pieces of the massive input sequence simultaneously, preventing loss of detail ## My Diagram --- ![[IMG_5133.png|450]] ## General --- - costs quadratic $O(n^2)$ time to run this algorithm - bottom layer is just a [[Bidirectional RNN]] that creates a "info" node for each word - each "info" node contains an encoding of the input word & words before + after it, contextually combined into one node - we apply the attention weights $\alpha$ on each "info" node for each output - so each output is informed by the input words that matter most - we create the "context" $C^{<i>}$ at each timestep $i$ by getting the attentionally weighted sum of each "info" node - each output $y^{<i>}$ receives its own unique "context" - this is fed into a $S^{<i>}$ square block, used to produce the output ![[CleanShot 2024-07-09 at [email protected]]] ![[CleanShot 2024-07-09 at [email protected]]] - as shown above, $t$ refers to current output timestamp, & $t^{\prime}$ refers to timestamp of a word you are potentially retrieving info from to inform (input) - there are actually connections above not drawn, specifically for the $\alpha$ attention weights - each of these weights $a^{<t,t^{\prime}>}$ have a small neural network - the previous hidden state $S^{<t-1>}$ & the "info" node $a^{<t^{\prime}>}$ are fed into a [[Deep Learning|Neural Network]] layer, which produces $e^{<t, t^{\prime}>}$, an un-normalized attention weight - $a^{<t,t^{\prime}>}$ is the normalized attention weight, simply normalizing across all attention weights for output at time $t$, so they sum to 1 - [[Softmax Activation Function]] - my thoughts: on a high intuitive level - when producing output at time $t$, & determining what input words to focus on - by default you need info about previous output predictions $S^{<t-1>}$ to output the next word - you also need the word itself - the above 2 together then can be fed into the previously described mini-network, to see how important that word is for the next prediction - (you know both the previous output sequence, & what the word is, to see how helpful it will be for the next prediction) ![[CleanShot 2024-07-09 at [email protected]]] ## Misc. --- - below is image used from programming assignment, where I drew some inspiration from for whiteboard attention diagram I drew above: ![[CleanShot 2024-07-09 at [email protected]]]