4. Drift Detection — The Quiet Killer

Your eval scores in CI are great. Your latency dashboard is flat. Your error rate is 0.1%. Customers are still complaining. What happened?

Four kinds of drift to enumerate:

Input drift

the distribution of user queries shifted. Maybe a new product launch sent a flood of pricing questions you weren't optimized for. Detected by: monitoring input embedding distribution over time (KL divergence between recent window and reference window), or by classifying inputs into categories and watching the histogram.

Retrieval drift

your index grew, or got stale, or rerankers changed. Same query now retrieves different chunks. Detected by: hashing the top-k chunk IDs for a stable set of canary queries and watching for changes.

Output drift

the model's outputs to similar queries have shifted (new model version, vendor update, temperature drift). Detected by: running a fixed canary set of queries through prod every hour and comparing outputs to a baseline (semantic similarity, judge score).

Eval drift

the judge changed. Vendor updated the judge model. Suddenly all your faithfulness scores look 10 points higher and you can't tell if quality went up or the judge got more lenient. Mitigation: pin the judge model version, recalibrate against humans periodically, run a "judge regression test" — does the judge still score known-good and known-bad examples correctly?

"The system can degrade without breaking. Drift detection is how you catch degradation in time. The order of likelihood is: input drift first, output drift second, retrieval drift third, eval drift least common but most insidious because it lies about everything else."