For any architecture question, drop these unprompted:
- "My eval suite would cover retrieval metrics — MRR and Recall@k — and generation metrics — Faithfulness and Answer Relevancy via a calibrated LLM-as-judge."
- "I'd run the eval suite on every PR and block regressions; in prod I'd run the same judge against 1-5% of sampled traffic to catch drift."
- "Per-stage tracing is non-negotiable — embed, retrieve, rerank, prompt-build, generate, each as its own span with token counts and cost annotations."
- "I track tokens/sec, TTFT, TPOT, input/output token distributions, and cost-per-request at p50/p95/p99, segmented by tenant and use case."
- "For drift I'd watch input-distribution KL divergence, output similarity on canary queries, and judge calibration drift quarterly."
- "PII never goes to shared observability — payloads stay tenant-scoped under CMEK, only metadata flows up."
- "v1 ships with a 30-example golden set and per-stage Cloud Trace; v2 expands the set from prod logs and adds sampled online evals."
8. The Three Things to Internalize
If everything else gets compressed under interview pressure, hold these:
- Offline eval is your can-it-be-good; online observability is your is-it-good-now. Both required, they feed each other, golden set is the spine.
- For a RAG/agent system, you need per-stage tracing or you can't diagnose anything. Tracing > logging > metrics for LLM debugging — the inverse of classical web service priority.
- Quality drift is invisible to classical observability. Sampled online evals are the only way to catch it. Customers don't have this and that's exactly the gap an FDE closes.