- 2.1 The Golden Dataset
- 2.2 Retrieval Metrics (for RAG systems)
- 2.3 Generation Metrics
- 2.4 Agent-Specific Metrics
- 2.5 LLM-as-Judge (the technique that powers most of the above)
2.6 Frameworks to Name
- RAGAS — the most-cited OSS RAG eval library. Implements Faithfulness, Answer Relevancy, Context Precision/Recall out of the box.
- DeepEval / Promptfoo — general LLM eval frameworks, support pytest-style assertions.
- Vertex AI Gen AI Evaluation Service — Google's managed offering, computational and model-based metrics, integrates with Vertex AI Pipelines. Name this in GCP-flavored answers.
- LangSmith / Langfuse — used for both evals and tracing; the dataset + experiments features are eval-flavored.
- TruLens — eval + observability oriented around RAG triad (groundedness, context relevance, answer relevance).