Scalability: 10K → Millions

The Core Mental Model: Three Axes

The Scaling Story: v1 → v10

v1 — 100 requests/day (you just launched)

What you have: One Cloud Run instance, one Gemini API call per query, one vector index (Vertex AI Vector Search or ChromaDB), one region.

What's fine: Everything. 100 requests/day is ~4/hour. Your system is over-provisioned. Single instance is fine.

What will break: The moment traffic spikes, that single instance becomes the bottleneck. No caching means every query hits the LLM — cost scales linearly. No observability means you're flying blind.

Nothing to fix yet, but plant these seeds: Turn on Cloud Logging and Cloud Trace from day one. They're cheap and you'll need the baseline.

v3 — 10,000 requests/day (~7 req/min)

What breaks at this scale: The single instance saturates. Cold starts become noticeable (Cloud Run spins down to zero between bursts). Repeated queries (same FAQ, same lookup) are burning LLM tokens you don't need to burn.

What you add:

Autoscaling: Cloud Run handles this natively — set min-instances=1 to kill cold starts, max-instances=N to cap cost. Now traffic bursts don't cause timeouts.

Exact-match cache: A Redis or Memorystore layer in front of the LLM. Hash the query string → look up cached response. Cache hit → skip vector search, skip LLM call entirely. Hit rate for FAQ-heavy workloads can be 30–40%.

Observability turned on: p50/p95/p99 latency dashboards in Cloud Monitoring. Token usage per request in Cloud Logging. You need this now because the next scale jump requires data to diagnose.

v5 — 100,000 requests/day (~70 req/min)

What breaks at this scale: Exact-match cache has limited coverage — users phrase the same question differently. Model costs are now meaningful (at $0.002/query × 100K = $200/day, it's a budget line item). Embedding precompute on ingest is slow because it's inline with the write path.

What you add:

Semantic cache: Instead of hashing the query string, embed the query and check cosine similarity against cached embeddings. If similarity > 0.95, return the cached response. This catches paraphrases that exact-match misses. Libraries like GPTCache implement this; you can build it yourself with a small vector store.

Model routing: Not every query needs Gemini Pro. Implement a triage layer — a fast, cheap model (Flash) classifies the query complexity. Simple factual lookups → Flash handles it. Ambiguous or multi-step reasoning → escalate to Pro. Rough cost reduction: 60–70% of queries go to the cheap model.