Attention

For each token, the model:

  1. Computes its Query vector.
  2. Compares that Query against every other token's Key → similarity scores.
  3. Softmax those scores → attention weights (sum to 1).
  4. Take a weighted sum of all the Value vectors using those weights.

That weighted sum becomes the token's new representation — enriched with information from whichever tokens it "attended to.”

How it work

Initially each tokens will be in the form of embedding vector (EV) (one fixed vector representing the token’s meaning in isolation).

Then from that model produces 3 more vectors Q,K,V by multiplying EV by 3 diff learned w8 matrices (W_Q,W_K,W_V learned during training) nd we get Q,K,V from that.

Then dot product of Q with every K → score → softmax → weighted sum of V with softmax scores as weight

Multi-head attention

multiple parallel attention mechanisms running on the same input.

A single attention head learns one pattern of how tokens should relate. But language has many simultaneous relationship types: