6. Anti-Patterns FDEs Find in Customer Systems

These are gold for verbal answers — drop them when discussing "what I'd assess in the first week."

No eval set at all — quality improvements judged by "I tried 5 examples in the playground." First thing to fix.
Eval set leaked into prompt examples — the few-shot examples in the prompt are also the eval cases. Score is meaningless.
Judge weaker than the system — using Gemini Flash to judge Gemini Pro output. The judge can't reliably grade above its own ceiling.
No version pinning on the judge — silently changes scores quarter-to-quarter.
Tracing only the top-level call — single span for the whole RAG pipeline. Can't diagnose where latency lives.
Logging raw PII to Cloud Logging — common, dangerous, easy to fix with Sensitive Data Protection.
Cost dashboards only at the project level — can't attribute cost to tenant, feature, or query type. Run cost out of control.
No canary deploys — full traffic shifts to a new prompt at once. Blast radius is the whole user base when something regresses.
Alerts on rate/error but not on quality — system is healthy by classical metrics, hallucinating in production for 12 hours before anyone notices.
Eval scores tracked but nobody knows what's in the set — a "92% faithfulness" with no documented composition is a vibe with a number on it.