7. Production concerns — the FDE lens

This is the section you should articulate fluently in interviews.

Ingestion pipeline: how do new documents get embedded and indexed? Batch (nightly job) is cheaper but adds latency to freshness. Streaming (Pub/Sub → Cloud Run → vector DB) is fresh but more ops. Most enterprises start batch, move to streaming when freshness matters. Document → chunk → embed → write to vector DB → write metadata to relational DB for filtering. Idempotency matters because retries happen.

Re-embedding: when you change embedding models (a real thing — better models come out), you must re-embed the entire corpus. This is expensive and disruptive. Strategy: dual-write to two indexes during migration, A/B test, cut over. Vector DBs that support multiple "namespaces" or "collections" make this easier.

Sharding: at scale, partition by tenant (natural for multi-tenant SaaS), by date, or by random hash. Cross-shard queries are expensive. Most managed services handle this transparently but it affects cost models.

Replication: for read scaling and HA. Most managed services do this for you. Self-hosted: you're on the hook.

Cost model: vector DBs typically charge by stored vectors (dimensionality × count), queries per second, and sometimes by index type (HNSW costs more than flat because of RAM). For Pinecone serverless, you pay per read/write unit. For Vertex AI Vector Search, you pay for the deployed endpoint (machine type per hour) plus storage. Cost decomposition for a RAG system typically looks like: 60-70% LLM inference, 10-20% embedding generation, 10-20% vector DB, rest is plumbing. Vector DB cost dominates only at extreme scale or with cheap LLMs.

Failure modes you should know cold:

Index out of sync with source data (deletes in source not propagated)
Stale embeddings (model upgraded but old vectors not re-embedded)
Filter-induced empty results (pre/post filter issue)
Cold-start latency on serverless DBs (first query after idle is slow)
Memory pressure causing OOM on HNSW indexes
Quantization loss on PQ-based indexes causing recall drops on rare queries

Observability: log query latency p50/p95/p99, recall against an offline golden set (sample 100 queries, compute recall@10 weekly), error rates, index size growth. Vertex AI Vector Search exposes these in Cloud Monitoring; Pinecone has built-in metrics. Recall is the metric most teams forget to track — they only notice when users complain.