A GenAI system without evals and observability is a demo, not a product.
┌──────────────────────┐
│ GenAI System Health │
└──────────┬───────────┘
│
┌────────────────┴────────────────┐
│ │
OFFLINE EVAL ONLINE OBSERVABILITY
"Is the system good?" "What is the system doing right now?"
Runs in CI / pre-deploy Runs continuously in prod
Fixed golden dataset Live production traffic
Optimizes for: quality Optimizes for: latency, cost, errors
Tools: pytest + judge LLM Tools: traces, metrics, logs
Cadence: every PR, nightly Cadence: real-time + alerts
│ │
└────────────────┬─────────────────┘
│
FEEDBACK LOOP
(prod failures become eval cases)
They feed each other: production failures become new test cases for offline eval; offline eval improvements become production deploys you watch via observability.
2. Offline Eval — The Structured Part
3. Online Observability — The Live Part
4. Drift Detection — The Quiet Killer
5. The Lifecycle — Putting It All Together
6. Anti-Patterns FDEs Find in Customer Systems
RETRIEVAL METRICS GENERATION METRICS
(query ↔ context) (involve the answer)
───────────────── ──────────────────
Hit@k Faithfulness (context ↔ answer)
MRR Answer Relevancy (query ↔ answer)
NDCG@k Answer Correctness (vs reference)
Recall@k Citation Accuracy
Context Precision Toxicity / Safety
Context Recall