Stage 1 — Parse

Get raw text out of whatever format the customer has. PDFs, Word docs, HTML, Confluence, Slack exports, scanned images, tables.

GCP service to name-drop: Document AI (especially the Layout Parser for complex documents, Form Parser for structured forms). Alternatives: unstructured.io, LlamaParse, AWS Textract.

What you extract: text, but also structure — headings, tables, page numbers, section hierarchy. You want this preserved because it informs chunking and citations.

Stage 2 — Chunk

Split documents into pieces small enough to be useful retrieval units, large enough to carry meaning.

Typical chunk is 200–800 tokens, with some overlap (~10–20%).

3. Chunking strategies

Stage 3 — Embed

Run each chunk through an embedding model. Output: a vector (e.g. 768 or 3072 floats) that represents the chunk's meaning in a high-dimensional space. Semantically similar chunks land near each other.

Models you should name:

Two embedding details that come up in interviews:

  1. Dimensionality vs cost trade-off. Higher dim = better recall, but more storage and slower search. Some models (Matryoshka embeddings) let you truncate dimensions and trade quality for cost.
  2. Embedding model lock-in. If you switch embedding models later, you must re-embed your entire corpus. Pick deliberately.

Embedding Model

Stage 4 — Store