- since in [[Transformer|Transformers]], the output of [[Multi Head Attention]] is always a weighted sum into a single vector encoding, you lose all positional representation of each word
- $pos$ is the position of the word in the input sequence
- $k$ is the dimensional index in $x^{<pos>}$
- we use $sin$ then $cos$ then $sin$ (repeats)
- although we input $k$ into $PE$, we have to retrieve $i$ from the $k$ for calculation, note the attached excerpt notice below
![[CleanShot 2024-07-10 at
[email protected]|300]]
![[CleanShot 2024-07-10 at
[email protected]|500]]
- we then get a positional encoding vector $p^{<1>}$
- my thought: note that each positional encoding vector (i.e green vs purple), will give us a vector made up of unique sin/cos value identifiers
- marks a unique position
- **we concatenate each positional encoding vector to their input word**
![[CleanShot 2024-07-10 at
[email protected]|400]]
- note above the top 2 have same frequency, are matched, but is sin vs cos
- bottom 2 have lower frequency, same thing