Context Window

cost mechanism:

self attention: every token computes a relevancy score against every other token in context, so N*N comparison

O(N²) scaling. Double the context, 4x the compute.

In practice, optimizations like

FlashAttention
KV caching
sparse attention

soften this, but the fundamental scaling is quadratic.

Prefill: process the entire input prompt at once → produces the first output token. This is the O(N²) heavy lift. Compute-bound.
Decode: generate output tokens one at a time, each attending to everything before it. Memory-bound (KV cache reads).

TTFT (time to first token) is dominated by prefill = scales with input length. TPOT (time per output token) is dominated by decode = scales with output length.

Prefill
Decode
Streaming

input length → TTFT. Output length → total latency. Two different optimizations.

Lost in the middle

put the critical fact at the end. Beginning is second-best. Middle is worst.

Training distribution: most training data has important info at the start (intros, headlines) or end (conclusions, recent context). Models learn to attend there.
Positional encoding artifacts: how positions are encoded (RoPE, ALiBi) may give edges of the context more reliable signal.
Recency bias in attention: in autoregressive decoding, recent tokens get strong attention by default.