Positional Encoding - Brendan Shih

- since in [[Transformer|Transformers]], the output of [[Multi Head Attention]] is always a weighted sum into a single vector encoding, you lose all positional representation of each word - $pos$ is the position of the word in the input sequence - $k$ is the dimensional index in $x^{<pos>}$ - we use $sin$ then $cos$ then $sin$ (repeats) - although we input $k$ into $PE$, we have to retrieve $i$ from the $k$ for calculation, note the attached excerpt notice below ![[CleanShot 2024-07-10 at [email protected]|300]] ![[CleanShot 2024-07-10 at [email protected]|500]] - we then get a positional encoding vector $p^{<1>}$ - my thought: note that each positional encoding vector (i.e green vs purple), will give us a vector made up of unique sin/cos value identifiers - marks a unique position - **we concatenate each positional encoding vector to their input word** ![[CleanShot 2024-07-10 at [email protected]|400]] - note above the top 2 have same frequency, are matched, but is sin vs cos - bottom 2 have lower frequency, same thing