If asked "how would you operationalize an agentic system at scale," this is the picture you draw, in this order.
┌─────────────────────────────────────────────────────────────┐
│ 1. DEVELOPMENT │
│ - Write golden dataset (start small, 30 examples) │
│ - Run eval suite locally before commit │
│ - Pytest assertions on quality thresholds │
└────────────────┬─────────────────────────────────────────────┘
│
┌────────────────▼─────────────────────────────────────────────┐
│ 2. CI/CD │
│ - On every PR: run full eval suite │
│ - Block merge if quality regresses > X% │
│ - Track scores over time (LangSmith / Vertex AI Pipelines)│
└────────────────┬─────────────────────────────────────────────┘
│
┌────────────────▼─────────────────────────────────────────────┐
│ 3. STAGED ROLLOUT │
│ - Deploy to 1% of traffic first (canary) │
│ - Compare online metrics: latency, cost, eval-on-sample │
│ - Promote if green for N hours, rollback if red │
└────────────────┬─────────────────────────────────────────────┘
│
┌────────────────▼─────────────────────────────────────────────┐
│ 4. PRODUCTION │
│ - Live tracing, metrics, sampled online evals │
│ - Alerts on: error rate, p95 latency, cost, quality drop │
│ - Dashboards per tenant / per use case │
└────────────────┬─────────────────────────────────────────────┘
│
┌────────────────▼─────────────────────────────────────────────┐
│ 5. FEEDBACK LOOP │
│ - Sampled failures become new golden-set entries │
│ - User thumbs-down → flagged for review → added to set │
│ - Weekly: review drift dashboards, retrain rerankers │
│ - Quarterly: recalibrate judge against humans │
└──────────────────────────────────────────────────────────────┘
(and loop back to step 1)