Self Attention - Brendan Shih

## Intuition --- - most important idea for understanding [[Transformer|Transformers]] - in below ex., Africa could be thought of as either historical place or happy vacation destination - depending on its surrounding words, its word encoding/meaning might be very different ![[CleanShot 2024-07-10 at [email protected]|350]] ## General --- - each input word has $q$ (query), $k$ (key), & $v$ (value), used to calculate the attention value for each word - terms analogous to in databases - below example is for a single $A^{<3>}$ encoding of a single input word $x^{<3>}$ ![[CleanShot 2024-07-10 at [email protected]]] - our goal is to convert an input word i.e $x^{<3>}$ to an encoded representation i.e $A^{<3>}$, containing all required joint input sequence context for that word - intuitively: - $Q$ = interesting questions about the word (what's happening in Africa) - $K$ = qualities of the word (action, person) - $V$ = specific representation of word (literal meaning of said word) - query $q^{<3>}$ contains information like a question about current word (i.e what's happening in Africa?) - key $k$ could say the word is an action or person - we use it to multiply with other words' keys $k$, which tells us how good another word answers the query question, which is what's happening in Africa? - in above Africa example, visit could have very good key for Africa's query - we give more weight to words that answer the query question well - we combine all input values weighted, by the above method, to get the encoded representation $A^{<3>}$ from $x^{<3>}$ - total weights sum to 1, we use a [[Softmax Activation Function|Softmax]] - final encoding of Africa would become like: there is a person visiting Africa ![[CleanShot 2024-07-10 at [email protected]|350]] - below is just the [[Vectorization|Vectorized]] version of above equation: - denominator is just to prevent explosion, don't need to worry about it - right-side columns in holistic graphic above are $Q$, $K$, & $V$ - I already mentally verified it works, just make each item i.e $q^{<1>}$ etc. a row vector, gives a final column vector of $(A^{<1>}, A^{<2>}, ...)$ ![[CleanShot 2024-07-10 at [email protected]|300]] - self attention has a key difference from regular [[Attention Model|Attention]] - regular [[Attention Model|Attention]] is used between *different* sequences (i.e input sequence to output sequence) - it enables decoder to attend to the encoded input when generating output - but self attention is used within the *same* sequence - creates an encoded representation of a sequence where each word's encoding has all joint sequence relationships already encoded into its representation - so you can use self attention on an input sequence alone, or on an output sequence alone - key intuition is that [[Word Embeddings]] start off with generic values for words/tokens like "mole" below - but with context, their embeddings change dramatically, & mean a "word"/thing that's very different - mole can be for chemistry, skin, or animal - & the "King" word embedding keeps getting updated with context in second example - becomes more & more **nuanced**, encodes contexts of i.e sentiment, tone etc. not just grammatical syntax & word position ![[CleanShot 2024-07-18 at [email protected]|350]] ![[CleanShot 2024-07-18 at [email protected]|350]] https://medium.com/@wwydmanski/whats-the-difference-between-self-attention-and-attention-in-transformer-architecture-3780404382f3 https://www.youtube.com/watch?v=eMlx5fFNoYc&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=6&ab_channel=3Blue1Brown