Implementing LLM on Vertex AI

The end-to-end Vertex AI architecture
┌─────────────────────────────────────────────────────────────────┐
│ DATA INGESTION                                                  │
│ ──────────────────                                              │
│ • Cloud Storage (GCS) — source documents (PDFs, docs, HTML)     │
│ • Document AI — parse PDFs, extract tables, OCR scans           │
│ • Pub/Sub — async ingestion triggers                            │
│ • Cloud Run — orchestration jobs                                │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ EMBEDDING & INDEXING                                            │
│ ───────────────────────                                         │
│ • Chunking — done in your Cloud Run job (LangChain/LlamaIndex)  │
│ • Embedding — Vertex AI text-embedding endpoint                 │
│     POST /v1/projects/.../publishers/google/models/             │
│          gemini-embedding-001:predict                           │
│ • Vector storage — Vertex AI Vector Search                      │
│     (managed ScaNN index, billions of vectors)                  │
│   Alternatives: BigQuery vector search, AlloyDB vector,         │
│   third-party (Pinecone, Weaviate, Chroma)                      │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ RETRIEVAL                                                       │
│ ──────────                                                      │
│ • Query embedding — same endpoint as above                      │
│ • Vector search — Vertex AI Vector Search nearest neighbors     │
│ • Optional: hybrid search (BM25 keyword + vector)               │
│ • Optional: reranker (Vertex AI Ranking API or cross-encoder)   │
│ • Metadata filters — applied at query time                      │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ GENERATION                                                      │
│ ──────────                                                      │
│ • Vertex AI Generative AI Studio — for prototyping              │
│ • Vertex AI API — for production                                │
│     POST /v1/projects/.../publishers/google/models/             │
│          gemini-2.5-pro:generateContent                         │
│ • Configurable: temperature, top-p, top-k, max_tokens,          │
│   safety filters, system instructions, tool definitions,        │
│   response schema (for structured output)                       │
│ • Streaming: streamGenerateContent for token-by-token output    │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ ORCHESTRATION                                                   │
│ ──────────────                                                  │
│ • Cloud Run — stateless API hosting your RAG/agent service      │
│ • GKE — for more complex deployments                            │
│ • Agent Development Kit (ADK) — Google's agent framework        │
│ • LangGraph / LlamaIndex — third-party orchestration            │
│ • Model Context Protocol (MCP) — tool/data integration          │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ EVALUATION & OBSERVABILITY                                      │
│ ──────────────────────────                                      │
│ • Vertex AI Evaluation — golden set runs, faithfulness eval     │
│ • Cloud Logging — request/response logs                         │
│ • Cloud Trace — distributed tracing across retrieval/gen        │
│ • Cloud Monitoring — latency, error rate, cost dashboards       │
│ • BigQuery — long-term log analytics                            │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ SECURITY & GOVERNANCE                                           │
│ ──────────────────────                                          │
│ • VPC Service Controls — network perimeter for data exfil       │
│ • CMEK — customer-managed encryption keys                       │
│ • Sensitive Data Protection (DLP) — PII detection/redaction     │
│ • IAM — fine-grained access to models and data                  │
│ • Model Armor — prompt injection / output filtering             │
└─────────────────────────────────────────────────────────────────┘
A complete request flow on Vertex AI