For each token, the model:
That weighted sum becomes the token's new representation — enriched with information from whichever tokens it "attended to.”
Initially each tokens will be in the form of embedding vector (EV) (one fixed vector representing the token’s meaning in isolation).
Then from that model produces 3 more vectors Q,K,V by multiplying EV by 3 diff learned w8 matrices (W_Q,W_K,W_V learned during training) nd we get Q,K,V from that.
Then dot product of Q with every K → score → softmax → weighted sum of V with softmax scores as weight
multiple parallel attention mechanisms running on the same input.
A single attention head learns one pattern of how tokens should relate. But language has many simultaneous relationship types: