- multi head attention outputs multiple *relational context encodings* of the input sequence, where each encoding has a **unique context** (what vs where vs who)
- each *relational context encoding* is same length as input
- conceptually just a big for loop over [[Self Attention]] (but in practice we do this [[Vectorization|Vectorized]])
- "head" refers to each time you calculate [[Self Attention]] for a sequence
- multi head means doing this multiple times
![[CleanShot 2024-07-10 at
[email protected]]]
- multi head attention enables us to gain relational context encodings, for ***different types of context***
- we can ask multiple types of questions
- $W_1$ head can just ask/answer "what's happening?" like in classic [[Self Attention]]
- $W_2$ could ask/answer "when is it happening?"
- $W_3$ could ask/answer "who has something to do with Africa?", give Africa is current target input word
- final output is concatenation of each self-attention output vector for each unique context type head
- one crude example is where we update the [[Word Embeddings]] of nouns with adjectives in the sentence
- the query would be "Any adjectives in front of me?"
- key would be whether or not the word is an adjective
- if key answers query well, then they will geometrically align well
![[CleanShot 2024-07-18 at
[email protected]|350]]
- gives a very rich representation of the sequence
- $h$ = # of heads
- note that unlike regular [[Self Attention]], you don't multiply $x^{<1>}$ by 3 unique weights to get $q^{<1>}$, $k^{<1>}$, $v^{<1>}$ in above holistic graph image
- in simplest case we use $q = k = v = x$
- we just use the head weights directly, no need to multiply twice redundantly (otherwise you get $W$ $W$)