## Intuition
---
- with super long sequences like possible in [[Recurrent Neural Networks|RNNs]], if you have a massive paragraph as input, a lot of its detail will get lost
- this is because you are processing the entire input before generating any output
- meaning that the entire input is squeezed into one hidden unit, causing loss of information
- attention lets us look at different pieces of the massive input sequence simultaneously, preventing loss of detail
## My Diagram
---
![[IMG_5133.png|450]]
## General
---
- costs quadratic $O(n^2)$ time to run this algorithm
- bottom layer is just a [[Bidirectional RNN]] that creates a "info" node for each word
- each "info" node contains an encoding of the input word & words before + after it, contextually combined into one node
- we apply the attention weights $\alpha$ on each "info" node for each output
- so each output is informed by the input words that matter most
- we create the "context" $C^{<i>}$ at each timestep $i$ by getting the attentionally weighted sum of each "info" node
- each output $y^{<i>}$ receives its own unique "context"
- this is fed into a $S^{<i>}$ square block, used to produce the output
![[CleanShot 2024-07-09 at
[email protected]]]
![[CleanShot 2024-07-09 at
[email protected]]]
- as shown above, $t$ refers to current output timestamp, & $t^{\prime}$ refers to timestamp of a word you are potentially retrieving info from to inform (input)
- there are actually connections above not drawn, specifically for the $\alpha$ attention weights
- each of these weights $a^{<t,t^{\prime}>}$ have a small neural network
- the previous hidden state $S^{<t-1>}$ & the "info" node $a^{<t^{\prime}>}$ are fed into a [[Deep Learning|Neural Network]] layer, which produces $e^{<t, t^{\prime}>}$, an un-normalized attention weight
- $a^{<t,t^{\prime}>}$ is the normalized attention weight, simply normalizing across all attention weights for output at time $t$, so they sum to 1
- [[Softmax Activation Function]]
- my thoughts: on a high intuitive level
- when producing output at time $t$, & determining what input words to focus on
- by default you need info about previous output predictions $S^{<t-1>}$ to output the next word
- you also need the word itself
- the above 2 together then can be fed into the previously described mini-network, to see how important that word is for the next prediction
- (you know both the previous output sequence, & what the word is, to see how helpful it will be for the next prediction)
![[CleanShot 2024-07-09 at
[email protected]]]
## Misc.
---
- below is image used from programming assignment, where I drew some inspiration from for whiteboard attention diagram I drew above:
![[CleanShot 2024-07-09 at
[email protected]]]