LLM Application Evaluation Language
5 exercises — Use the precise vocabulary for faithfulness, context recall, hallucination, LLM-as-judge, and eval dataset curation.
0 / 5 completed
Quick reference: LLM Evaluation Metrics
- faithfulness — every claim in the answer is supported by retrieved context (no hallucination)
- answer relevance — the answer addresses the user's actual question
- context recall — fraction of all relevant corpus chunks that were retrieved
- LLM-as-judge — using a language model to score another model's outputs at scale
- groundedness score — proportion of response claims traceable to real source evidence
1 / 5
A colleague explains their RAG evaluation setup: "We run two RAGAS metrics — faithfulness and answer relevance. They keep getting confused in our team docs." What is the precise distinction between these two metrics?
Faithfulness and answer relevance measure orthogonal failure modes.
Faithfulness asks: "Is every claim in the generated answer supported by the retrieved context?" A faithful answer contains no invented facts — it sticks to what the retriever found. A low faithfulness score means the model is hallucinating content not present in the context chunks.
Answer relevance asks: "Does the answer actually address the user's question?" A perfectly faithful answer (grounded in context) can still be irrelevant if the retriever returned off-topic chunks and the model answered them faithfully but missed the real question.
In practice, a production RAG system needs both high faithfulness (no hallucination) and high answer relevance (on-topic responses). RAGAS computes both with LLM-as-judge calls.
Key vocabulary:
• faithfulness — the degree to which every claim in the answer is supported by the retrieved context
• answer relevance — the degree to which the answer addresses the user's actual question
• RAGAS — an open-source framework for evaluating RAG pipelines with LLM-assisted metrics
• hallucination — generating content not supported by the provided context or factual evidence
Faithfulness asks: "Is every claim in the generated answer supported by the retrieved context?" A faithful answer contains no invented facts — it sticks to what the retriever found. A low faithfulness score means the model is hallucinating content not present in the context chunks.
Answer relevance asks: "Does the answer actually address the user's question?" A perfectly faithful answer (grounded in context) can still be irrelevant if the retriever returned off-topic chunks and the model answered them faithfully but missed the real question.
In practice, a production RAG system needs both high faithfulness (no hallucination) and high answer relevance (on-topic responses). RAGAS computes both with LLM-as-judge calls.
Key vocabulary:
• faithfulness — the degree to which every claim in the answer is supported by the retrieved context
• answer relevance — the degree to which the answer addresses the user's actual question
• RAGAS — an open-source framework for evaluating RAG pipelines with LLM-assisted metrics
• hallucination — generating content not supported by the provided context or factual evidence