Get raw text out of whatever format the customer has. PDFs, Word docs, HTML, Confluence, Slack exports, scanned images, tables.
GCP service to name-drop: Document AI (especially the Layout Parser for complex documents, Form Parser for structured forms). Alternatives: unstructured.io, LlamaParse, AWS Textract.
What you extract: text, but also structure — headings, tables, page numbers, section hierarchy. You want this preserved because it informs chunking and citations.
Split documents into pieces small enough to be useful retrieval units, large enough to carry meaning.
Typical chunk is 200–800 tokens, with some overlap (~10–20%).
Run each chunk through an embedding model. Output: a vector (e.g. 768 or 3072 floats) that represents the chunk's meaning in a high-dimensional space. Semantically similar chunks land near each other.
Models you should name:
text-embedding-004 / gemini-embedding-001 — Google's current production embedding modelstext-embedding-3-large / small — common in non-GCP shopsTwo embedding details that come up in interviews: