- multi head attention outputs multiple *relational context encodings* of the input sequence, where each encoding has a **unique context** (what vs where vs who) - each *relational context encoding* is same length as input - conceptually just a big for loop over [[Self Attention]] (but in practice we do this [[Vectorization|Vectorized]]) - "head" refers to each time you calculate [[Self Attention]] for a sequence - multi head means doing this multiple times ![[CleanShot 2024-07-10 at [email protected]]] - multi head attention enables us to gain relational context encodings, for ***different types of context*** - we can ask multiple types of questions - $W_1$ head can just ask/answer "what's happening?" like in classic [[Self Attention]] - $W_2$ could ask/answer "when is it happening?" - $W_3$ could ask/answer "who has something to do with Africa?", give Africa is current target input word - final output is concatenation of each self-attention output vector for each unique context type head - one crude example is where we update the [[Word Embeddings]] of nouns with adjectives in the sentence - the query would be "Any adjectives in front of me?" - key would be whether or not the word is an adjective - if key answers query well, then they will geometrically align well ![[CleanShot 2024-07-18 at [email protected]|350]] - gives a very rich representation of the sequence - $h$ = # of heads - note that unlike regular [[Self Attention]], you don't multiply $x^{<1>}$ by 3 unique weights to get $q^{<1>}$, $k^{<1>}$, $v^{<1>}$ in above holistic graph image - in simplest case we use $q = k = v = x$ - we just use the head weights directly, no need to multiply twice redundantly (otherwise you get $W$ $W$)