## Intuition
---
- most important idea for understanding [[Transformer|Transformers]]
- in below ex., Africa could be thought of as either historical place or happy vacation destination
- depending on its surrounding words, its word encoding/meaning might be very different
![[CleanShot 2024-07-10 at
[email protected]|350]]
## General
---
- each input word has $q$ (query), $k$ (key), & $v$ (value), used to calculate the attention value for each word
- terms analogous to in databases
- below example is for a single $A^{<3>}$ encoding of a single input word $x^{<3>}$
![[CleanShot 2024-07-10 at
[email protected]]]
- our goal is to convert an input word i.e $x^{<3>}$ to an encoded representation i.e $A^{<3>}$, containing all required joint input sequence context for that word
- intuitively:
- $Q$ = interesting questions about the word (what's happening in Africa)
- $K$ = qualities of the word (action, person)
- $V$ = specific representation of word (literal meaning of said word)
- query $q^{<3>}$ contains information like a question about current word (i.e what's happening in Africa?)
- key $k$ could say the word is an action or person
- we use it to multiply with other words' keys $k$, which tells us how good another word answers the query question, which is what's happening in Africa?
- in above Africa example, visit could have very good key for Africa's query
- we give more weight to words that answer the query question well
- we combine all input values weighted, by the above method, to get the encoded representation $A^{<3>}$ from $x^{<3>}$
- total weights sum to 1, we use a [[Softmax Activation Function|Softmax]]
- final encoding of Africa would become like: there is a person visiting Africa
![[CleanShot 2024-07-10 at
[email protected]|350]]
- below is just the [[Vectorization|Vectorized]] version of above equation:
- denominator is just to prevent explosion, don't need to worry about it
- right-side columns in holistic graphic above are $Q$, $K$, & $V$
- I already mentally verified it works, just make each item i.e $q^{<1>}$ etc. a row vector, gives a final column vector of $(A^{<1>}, A^{<2>}, ...)$
![[CleanShot 2024-07-10 at
[email protected]|300]]
- self attention has a key difference from regular [[Attention Model|Attention]]
- regular [[Attention Model|Attention]] is used between *different* sequences (i.e input sequence to output sequence)
- it enables decoder to attend to the encoded input when generating output
- but self attention is used within the *same* sequence
- creates an encoded representation of a sequence where each word's encoding has all joint sequence relationships already encoded into its representation
- so you can use self attention on an input sequence alone, or on an output sequence alone
- key intuition is that [[Word Embeddings]] start off with generic values for words/tokens like "mole" below
- but with context, their embeddings change dramatically, & mean a "word"/thing that's very different
- mole can be for chemistry, skin, or animal
- & the "King" word embedding keeps getting updated with context in second example
- becomes more & more **nuanced**, encodes contexts of i.e sentiment, tone etc. not just grammatical syntax & word position
![[CleanShot 2024-07-18 at
[email protected]|350]]
![[CleanShot 2024-07-18 at
[email protected]|350]]
https://medium.com/@wwydmanski/whats-the-difference-between-self-attention-and-attention-in-transformer-architecture-3780404382f3
https://www.youtube.com/watch?v=eMlx5fFNoYc&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=6&ab_channel=3Blue1Brown