Patterns that work

. OpenAI: response_format={"type":"json_schema", ...}. Under the hood these use constrained decoding (the sampler is masked to only allow tokens that keep the output valid against the grammar). Result: zero markdown contamination, deterministic parsing, no try/except json.JSONDecodeError retry loops. Always use this when the downstream is code, not a human."> . OpenAI: response_format={"type":"json_schema", ...}. Under the hood these use constrained decoding (the sampler is masked to only allow tokens that keep the output valid against the grammar). Result: zero markdown contamination, deterministic parsing, no try/except json.JSONDecodeError retry loops. Always use this when the downstream is code, not a human."> . OpenAI: response_format={"type":"json_schema", ...}. Under the hood these use constrained decoding (the sampler is masked to only allow tokens that keep the output valid against the grammar). Result: zero markdown contamination, deterministic parsing, no try/except json.JSONDecodeError retry loops. Always use this when the downstream is code, not a human.">

Strict output schema (structured outputs). Don't ask for JSON in prose — use the model's structured output mode. Gemini: response_mime_type="application/json" plus response_schema=<Pydantic model or JSON Schema>. OpenAI: response_format={"type":"json_schema", ...}. Under the hood these use constrained decoding (the sampler is masked to only allow tokens that keep the output valid against the grammar). Result: zero markdown contamination, deterministic parsing, no try/except json.JSONDecodeError retry loops. Always use this when the downstream is code, not a human.

Chain-of-verification (CoVe). Pattern: (1) generate initial draft answer, (2) ask the model to produce verification questions about its own claims, (3) answer each verification question independently (separate call, no draft in context — this is the key), (4) revise the draft using verification answers. Reduces hallucination meaningfully on factual tasks. The "independently" step matters because if the original draft is in context during verification, the model is biased to confirm it. Worth its cost on high-stakes outputs (legal Q&A, medical summaries), overkill for low-stakes (casual chat).

Constitutional / self-critique. Give the model an explicit set of principles, then a critique pass: "Review your draft against these rules: [1] no claims outside the provided context, [2] no advice that requires a licensed professional, [3] escalate if the user mentions self-harm. List violations, then rewrite." Often catches the things a static system prompt fails to enforce. Costs an extra call; you pick where it's worth it.

Prompt compression. Long context is expensive ($/token), slow (TTFT scales with input), and quality-degrading (lost-in-the-middle). Three approaches stack:

Retrieval compression — retrieve fewer, better chunks. Rerank top-50 down to top-5.
Summarization — for conversation history, maintain a rolling summary instead of full transcript. For agent observation logs, compress old steps into a digest.
Token-level compression — tools like LLMLingua learn to drop low-information tokens. Niche, but real when you're hitting context limits at scale.

The FDE framing: prompt compression is a cost and latency lever, not a correctness lever. Reach for it when SLOs slip, not before.