Hit@k / Recall@k
"Of all the queries in my eval set, what fraction had the correct chunk in the top-k retrieved?"
If Hit@5 is 60%, your retrieval is failing on 40% of queries — before the LLM even gets a chance. Hit@5 is the most quoted RAG metric for a reason.
MRR — Mean Reciprocal Rank
"On average, at what rank did the correct chunk appear? Lower rank is better."
MRR captures "we found it, but it was buried." Hit@5 says yes/no; MRR says where. Critical for tuning rerankers.
NDCG — Normalized Discounted Cumulative Gain
"Like MRR but for queries that have multiple correct chunks, weighting earlier positions more."
You'll mention this in interviews; you'll rarely tune to it directly unless you have graded relevance labels.
Context Precision
"Of the top-k chunks I retrieved, what fraction were actually relevant?"
If you retrieve k=10 and only 2 were useful, your precision is 20% — you're paying for 8 chunks of context bloat in every LLM call. This is the cost-quality lever.
Context Recall
"Of all the chunks that should have been retrieved to answer this query, what fraction did I get?"
For multi-hop or comprehensive queries where the answer spans multiple chunks.
Faithfulness / Groundedness
"Are the claims in the answer supported by the retrieved chunks?"
This is the anti-hallucination metric. Implemented as: extract atomic claims from the answer; for each claim, ask a judge model whether it's supported by the context. Score = supported / total.
Answer Relevancy