cost mechanism:

self attention: every token computes a relevancy score against every other token in context, so N*N comparison

O(N²) scaling. Double the context, 4x the compute.

In practice, optimizations like

soften this, but the fundamental scaling is quadratic.

TTFT (time to first token) is dominated by prefill = scales with input length. TPOT (time per output token) is dominated by decode = scales with output length.

input length → TTFT. Output length → total latency. Two different optimizations.

Lost in the middle

put the critical fact at the end. Beginning is second-best. Middle is worst.