These are gold for verbal answers — drop them when discussing "what I'd assess in the first week."
- No eval set at all — quality improvements judged by "I tried 5 examples in the playground." First thing to fix.
- Eval set leaked into prompt examples — the few-shot examples in the prompt are also the eval cases. Score is meaningless.
- Judge weaker than the system — using Gemini Flash to judge Gemini Pro output. The judge can't reliably grade above its own ceiling.
- No version pinning on the judge — silently changes scores quarter-to-quarter.
- Tracing only the top-level call — single span for the whole RAG pipeline. Can't diagnose where latency lives.
- Logging raw PII to Cloud Logging — common, dangerous, easy to fix with Sensitive Data Protection.
- Cost dashboards only at the project level — can't attribute cost to tenant, feature, or query type. Run cost out of control.
- No canary deploys — full traffic shifts to a new prompt at once. Blast radius is the whole user base when something regresses.
- Alerts on rate/error but not on quality — system is healthy by classical metrics, hallucinating in production for 12 hours before anyone notices.
- Eval scores tracked but nobody knows what's in the set — a "92% faithfulness" with no documented composition is a vibe with a number on it.