Fundamentals

Every System Has 4 Layers to Reason Through

LOAD →  [Ingress / Load Balancer]
           ↓
        [Application Layer]  ← stateless, horizontally scalable
           ↓
        [Data Layer]         ← stateful, the hard part
           ↓
        [External Deps]      ← auth, LLM APIs, payment, etc.

The Core Vocabulary

Scalability patterns:

Problem Solution Trade-off
Too many requests Horizontal scaling (more instances) Stateless requirement
Slow DB reads Read replicas Eventual consistency on reads
Hot DB rows Caching (Redis, Memcached) Cache invalidation complexity
Large datasets Sharding / partitioning Cross-shard queries become expensive
Bursty traffic Queue-based async processing Adds latency, complexity
Global users CDN + multi-region Data residency, sync complexity

Caching tiers (from closest to user to furthest):

Async vs Sync:

Consistency trade-offs (CAP theorem — one sentence version):

The GenAI-Specific Layer

For FDE, system design questions will often have an AI component. The patterns that map onto standard distributed systems:

Standard Pattern GenAI Equivalent
Caching hot DB queries Semantic caching (cache similar prompts, not just exact matches)
Model routing Flash for triage → Pro for hard queries (cost + latency optimization)
Async queue Batch embedding jobs, offline document ingestion
Read replicas Serving a pre-built vector index vs re-indexing live
Rate limiting Per-tenant token budget, per-user request throttling
CDN Prompt template caching, pre-computed FAQ answers