If asked "how would you operationalize an agentic system at scale," this is the picture you draw, in this order.

┌─────────────────────────────────────────────────────────────┐
│  1. DEVELOPMENT                                              │
│     - Write golden dataset (start small, 30 examples)        │
│     - Run eval suite locally before commit                   │
│     - Pytest assertions on quality thresholds                │
└────────────────┬─────────────────────────────────────────────┘
                 │
┌────────────────▼─────────────────────────────────────────────┐
│  2. CI/CD                                                     │
│     - On every PR: run full eval suite                        │
│     - Block merge if quality regresses > X%                   │
│     - Track scores over time (LangSmith / Vertex AI Pipelines)│
└────────────────┬─────────────────────────────────────────────┘
                 │
┌────────────────▼─────────────────────────────────────────────┐
│  3. STAGED ROLLOUT                                            │
│     - Deploy to 1% of traffic first (canary)                  │
│     - Compare online metrics: latency, cost, eval-on-sample   │
│     - Promote if green for N hours, rollback if red           │
└────────────────┬─────────────────────────────────────────────┘
                 │
┌────────────────▼─────────────────────────────────────────────┐
│  4. PRODUCTION                                                │
│     - Live tracing, metrics, sampled online evals             │
│     - Alerts on: error rate, p95 latency, cost, quality drop  │
│     - Dashboards per tenant / per use case                    │
└────────────────┬─────────────────────────────────────────────┘
                 │
┌────────────────▼─────────────────────────────────────────────┐
│  5. FEEDBACK LOOP                                             │
│     - Sampled failures become new golden-set entries          │
│     - User thumbs-down → flagged for review → added to set    │
│     - Weekly: review drift dashboards, retrain rerankers      │
│     - Quarterly: recalibrate judge against humans             │
└──────────────────────────────────────────────────────────────┘
              (and loop back to step 1)