cost mechanism:
self attention: every token computes a relevancy score against every other token in context, so N*N comparison
O(N²) scaling. Double the context, 4x the compute.
In practice, optimizations like
- FlashAttention
- KV caching
- sparse attention
soften this, but the fundamental scaling is quadratic.
- Prefill: process the entire input prompt at once → produces the first output token. This is the O(N²) heavy lift. Compute-bound.
- Decode: generate output tokens one at a time, each attending to everything before it. Memory-bound (KV cache reads).
TTFT (time to first token) is dominated by prefill = scales with input length. TPOT (time per output token) is dominated by decode = scales with output length.
input length → TTFT. Output length → total latency. Two different optimizations.
Lost in the middle
put the critical fact at the end. Beginning is second-best. Middle is worst.
- Training distribution: most training data has important info at the start (intros, headlines) or end (conclusions, recent context). Models learn to attend there.
- Positional encoding artifacts: how positions are encoded (RoPE, ALiBi) may give edges of the context more reliable signal.
- Recency bias in attention: in autoregressive decoding, recent tokens get strong attention by default.