The end-to-end Vertex AI architecture
┌─────────────────────────────────────────────────────────────────┐
│ DATA INGESTION │
│ ────────────────── │
│ • Cloud Storage (GCS) — source documents (PDFs, docs, HTML) │
│ • Document AI — parse PDFs, extract tables, OCR scans │
│ • Pub/Sub — async ingestion triggers │
│ • Cloud Run — orchestration jobs │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ EMBEDDING & INDEXING │
│ ─────────────────────── │
│ • Chunking — done in your Cloud Run job (LangChain/LlamaIndex) │
│ • Embedding — Vertex AI text-embedding endpoint │
│ POST /v1/projects/.../publishers/google/models/ │
│ gemini-embedding-001:predict │
│ • Vector storage — Vertex AI Vector Search │
│ (managed ScaNN index, billions of vectors) │
│ Alternatives: BigQuery vector search, AlloyDB vector, │
│ third-party (Pinecone, Weaviate, Chroma) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ RETRIEVAL │
│ ────────── │
│ • Query embedding — same endpoint as above │
│ • Vector search — Vertex AI Vector Search nearest neighbors │
│ • Optional: hybrid search (BM25 keyword + vector) │
│ • Optional: reranker (Vertex AI Ranking API or cross-encoder) │
│ • Metadata filters — applied at query time │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ GENERATION │
│ ────────── │
│ • Vertex AI Generative AI Studio — for prototyping │
│ • Vertex AI API — for production │
│ POST /v1/projects/.../publishers/google/models/ │
│ gemini-2.5-pro:generateContent │
│ • Configurable: temperature, top-p, top-k, max_tokens, │
│ safety filters, system instructions, tool definitions, │
│ response schema (for structured output) │
│ • Streaming: streamGenerateContent for token-by-token output │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ ORCHESTRATION │
│ ────────────── │
│ • Cloud Run — stateless API hosting your RAG/agent service │
│ • GKE — for more complex deployments │
│ • Agent Development Kit (ADK) — Google's agent framework │
│ • LangGraph / LlamaIndex — third-party orchestration │
│ • Model Context Protocol (MCP) — tool/data integration │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ EVALUATION & OBSERVABILITY │
│ ────────────────────────── │
│ • Vertex AI Evaluation — golden set runs, faithfulness eval │
│ • Cloud Logging — request/response logs │
│ • Cloud Trace — distributed tracing across retrieval/gen │
│ • Cloud Monitoring — latency, error rate, cost dashboards │
│ • BigQuery — long-term log analytics │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SECURITY & GOVERNANCE │
│ ────────────────────── │
│ • VPC Service Controls — network perimeter for data exfil │
│ • CMEK — customer-managed encryption keys │
│ • Sensitive Data Protection (DLP) — PII detection/redaction │
│ • IAM — fine-grained access to models and data │
│ • Model Armor — prompt injection / output filtering │
└─────────────────────────────────────────────────────────────────┘