Your eval scores in CI are great. Your latency dashboard is flat. Your error rate is 0.1%. Customers are still complaining. What happened?
Four kinds of drift to enumerate:
Input drift
the distribution of user queries shifted. Maybe a new product launch sent a flood of pricing questions you weren't optimized for. Detected by: monitoring input embedding distribution over time (KL divergence between recent window and reference window), or by classifying inputs into categories and watching the histogram.
Retrieval drift
your index grew, or got stale, or rerankers changed. Same query now retrieves different chunks. Detected by: hashing the top-k chunk IDs for a stable set of canary queries and watching for changes.
Output drift
the model's outputs to similar queries have shifted (new model version, vendor update, temperature drift). Detected by: running a fixed canary set of queries through prod every hour and comparing outputs to a baseline (semantic similarity, judge score).
Eval drift
the judge changed. Vendor updated the judge model. Suddenly all your faithfulness scores look 10 points higher and you can't tell if quality went up or the judge got more lenient. Mitigation: pin the judge model version, recalibrate against humans periodically, run a "judge regression test" — does the judge still score known-good and known-bad examples correctly?
"The system can degrade without breaking. Drift detection is how you catch degradation in time. The order of likelihood is: input drift first, output drift second, retrieval drift third, eval drift least common but most insidious because it lies about everything else."