The governing idea is simple: production systems will fail; your job is to design for graceful failure, fast recovery, and continuous learning.
Three pillars: Reliability, Resilience, Performance.
- Pillar 1 — Reliability
- Pillar 2 — Resilience
- Pillar 3 — Performance
How to Use This in an Interview
When you sketch any architecture and the interviewer asks "how do you keep this reliable in production?", your answer should hit:
- SLO for each critical path (latency, error rate)
- Timeouts on every external call
- Circuit breakers around your LLM and vector DB
- Retry with exponential backoff + jitter
- Graceful degradation path (cached fallback, simplified response)
- Bulkheads if multi-tenant
- Blameless postmortem process when things fail
- Distributed tracing so you can diagnose fast
You don't need all eight in every answer — but touching 4–5 unprompted signals strong operational maturity.