(what to measure, with what tools, how to use the results)
The single most-cited piece of customer advice in this entire field: you cannot improve what you don't measure. Every senior FDE has war stories about customers who "added more retrieval techniques" for six months and made things worse because they had no eval.
Eval splits into three orthogonal dimensions:
| Dimension | Question | Metric family |
|---|---|---|
| Retrieval quality | Did we get the right chunks? | Hit@k, MRR, NDCG, Context Precision/Recall |
| Generation quality | Did we answer correctly given those chunks? | Faithfulness, Answer Relevancy, Correctness |
| End-to-end quality | Is the user happy? | Task success, user feedback, business metric |
You measure all three because fixing one without the others is invisible.
Retrieval metrics — know these cold
"Every change to chunking, embedding, retrieval, or prompt triggers an automated eval run against the golden set. PRs that drop Hit@5 or Faithfulness by more than 2% fail CI. Same way you'd test code. The eval set becomes the contract.”
Eval is regression testing for RAG. You build it once per customer, it pays back forever. And it's a clean answer to "how do we trust this in production" — the answer is "the same way we trust any other production system: tested, versioned, regression-gated."
Eval = offline, on a fixed set. Observability = online, on real traffic. You need both.
GCP services to name: Cloud Trace for span-level tracing, Cloud Logging for structured logs, Cloud Monitoring for dashboards/alerts, Looker or BigQuery for analysis. Vertex AI has built-in agent tracing.