Transformers | Notion

Attention

The intuition — Q, K, V
Library analogy

For each token, the model:

Computes its Query vector.
Compares that Query against every other token's Key → similarity scores.
Softmax those scores → attention weights (sum to 1).
Take a weighted sum of all the Value vectors using those weights.

That weighted sum becomes the token's new representation — enriched with information from whichever tokens it "attended to.”

Self-Attention
Cross-Attention

How it work

Initially each tokens will be in the form of embedding vector (EV) (one fixed vector representing the token’s meaning in isolation).

Then from that model produces 3 more vectors Q,K,V by multiplying EV by 3 diff learned w8 matrices (W_Q,W_K,W_V learned during training) nd we get Q,K,V from that.

Then dot product of Q with every K → score → softmax → weighted sum of V with softmax scores as weight

Summary

Multi-head attention

multiple parallel attention mechanisms running on the same input.

A single attention head learns one pattern of how tokens should relate. But language has many simultaneous relationship types:

Syntactic: which verb does this subject belong to?