Hit@k / Recall@k

"Of all the queries in my eval set, what fraction had the correct chunk in the top-k retrieved?"

If Hit@5 is 60%, your retrieval is failing on 40% of queries — before the LLM even gets a chance. Hit@5 is the most quoted RAG metric for a reason.

MRR — Mean Reciprocal Rank

"On average, at what rank did the correct chunk appear? Lower rank is better."

MRR captures "we found it, but it was buried." Hit@5 says yes/no; MRR says where. Critical for tuning rerankers.

NDCG — Normalized Discounted Cumulative Gain

"Like MRR but for queries that have multiple correct chunks, weighting earlier positions more."

You'll mention this in interviews; you'll rarely tune to it directly unless you have graded relevance labels.

Context Precision

"Of the top-k chunks I retrieved, what fraction were actually relevant?"

If you retrieve k=10 and only 2 were useful, your precision is 20% — you're paying for 8 chunks of context bloat in every LLM call. This is the cost-quality lever.

Context Recall

"Of all the chunks that should have been retrieved to answer this query, what fraction did I get?"

For multi-hop or comprehensive queries where the answer spans multiple chunks.

Generation metrics — know these too

Faithfulness / Groundedness

"Are the claims in the answer supported by the retrieved chunks?"

This is the anti-hallucination metric. Implemented as: extract atomic claims from the answer; for each claim, ask a judge model whether it's supported by the context. Score = supported / total.

Answer Relevancy