7. The Verbal Cheat Sheet

For any architecture question, drop these unprompted:

"My eval suite would cover retrieval metrics — MRR and Recall@k — and generation metrics — Faithfulness and Answer Relevancy via a calibrated LLM-as-judge."
"I'd run the eval suite on every PR and block regressions; in prod I'd run the same judge against 1-5% of sampled traffic to catch drift."
"Per-stage tracing is non-negotiable — embed, retrieve, rerank, prompt-build, generate, each as its own span with token counts and cost annotations."
"I track tokens/sec, TTFT, TPOT, input/output token distributions, and cost-per-request at p50/p95/p99, segmented by tenant and use case."
"For drift I'd watch input-distribution KL divergence, output similarity on canary queries, and judge calibration drift quarterly."
"PII never goes to shared observability — payloads stay tenant-scoped under CMEK, only metadata flows up."
"v1 ships with a 30-example golden set and per-stage Cloud Trace; v2 expands the set from prod logs and adds sampled online evals."

8. The Three Things to Internalize

If everything else gets compressed under interview pressure, hold these:

Offline eval is your can-it-be-good; online observability is your is-it-good-now. Both required, they feed each other, golden set is the spine.
For a RAG/agent system, you need per-stage tracing or you can't diagnose anything. Tracing > logging > metrics for LLM debugging — the inverse of classical web service priority.
Quality drift is invisible to classical observability. Sampled online evals are the only way to catch it. Customers don't have this and that's exactly the gap an FDE closes.