the eval loop | Notion

(what to measure, with what tools, how to use the results)

The single most-cited piece of customer advice in this entire field: you cannot improve what you don't measure. Every senior FDE has war stories about customers who "added more retrieval techniques" for six months and made things worse because they had no eval.

The mental model

Eval splits into three orthogonal dimensions:

Dimension	Question	Metric family
Retrieval quality	Did we get the right chunks?	Hit@k, MRR, NDCG, Context Precision/Recall
Generation quality	Did we answer correctly given those chunks?	Faithfulness, Answer Relevancy, Correctness
End-to-end quality	Is the user happy?	Task success, user feedback, business metric

You measure all three because fixing one without the others is invisible.

Retrieval metrics — know these cold

LLM-as-judge
Building a golden eval set
The eval frameworks to name

"Every change to chunking, embedding, retrieval, or prompt triggers an automated eval run against the golden set. PRs that drop Hit@5 or Faithfulness by more than 2% fail CI. Same way you'd test code. The eval set becomes the contract.”

Eval is regression testing for RAG. You build it once per customer, it pays back forever. And it's a clean answer to "how do we trust this in production" — the answer is "the same way we trust any other production system: tested, versioned, regression-gated."

Observability — the runtime complement

Eval = offline, on a fixed set. Observability = online, on real traffic. You need both.

What to log per query (this is a checklist worth memorizing):

GCP services to name: Cloud Trace for span-level tracing, Cloud Logging for structured logs, Cloud Monitoring for dashboards/alerts, Looker or BigQuery for analysis. Vertex AI has built-in agent tracing.

Drift detection — the thing customers don't see coming