Eval & Observability

A GenAI system without evals and observability is a demo, not a product.

1. The Core Conceptual Split: Offline Eval vs Online Observability

                    ┌──────────────────────┐
                    │  GenAI System Health │
                    └──────────┬───────────┘
                               │
              ┌────────────────┴────────────────┐
              │                                  │
       OFFLINE EVAL                      ONLINE OBSERVABILITY
       "Is the system good?"             "What is the system doing right now?"
       Runs in CI / pre-deploy           Runs continuously in prod
       Fixed golden dataset              Live production traffic
       Optimizes for: quality            Optimizes for: latency, cost, errors
       Tools: pytest + judge LLM         Tools: traces, metrics, logs
       Cadence: every PR, nightly        Cadence: real-time + alerts
              │                                  │
              └────────────────┬─────────────────┘
                               │
                       FEEDBACK LOOP
              (prod failures become eval cases)

Offline eval answers "if I change the prompt / model / retriever, did quality go up or down?" — you run it against a fixed set so the comparison is apples-to-apples.
Online observability answers "is the live system currently healthy?" — you watch live traffic, you can't replay it, you can't compare runs.

They feed each other: production failures become new test cases for offline eval; offline eval improvements become production deploys you watch via observability.

2. Offline Eval — The Structured Part

3. Online Observability — The Live Part

4. Drift Detection — The Quiet Killer

5. The Lifecycle — Putting It All Together

6. Anti-Patterns FDEs Find in Customer Systems

7. The Verbal Cheat Sheet

RETRIEVAL METRICS              GENERATION METRICS
(query ↔ context)              (involve the answer)
─────────────────              ──────────────────
Hit@k                          Faithfulness  (context ↔ answer)
MRR                            Answer Relevancy  (query ↔ answer)
NDCG@k                         Answer Correctness  (vs reference)
Recall@k                       Citation Accuracy
Context Precision              Toxicity / Safety
Context Recall