Advanced LLM App Development #llm-evaluation#faithfulness#hallucination#ragas

LLM Application Evaluation Language

5 exercises — Use the precise vocabulary for faithfulness, context recall, hallucination, LLM-as-judge, and eval dataset curation.

0 / 5 completed

1 / 5

A colleague explains their RAG evaluation setup: "We run two RAGAS metrics — faithfulness and answer relevance. They keep getting confused in our team docs." What is the precise distinction between these two metrics?

2 / 5

In a post-mortem, an engineer reports: "Context recall was 0.6 — we had the answer in the corpus but only retrieved 60% of the relevant chunks. Context precision was 0.9 — almost everything we retrieved was relevant." Which retrieval problem does the low context recall indicate?

3 / 5

A team lead writes in the incident report: "The model hallucinated a citation — it fabricated a paper titled 'Smith et al. 2019' that does not exist. The groundedness score for this response was 0.2." What does groundedness score measure?

4 / 5

An engineer proposes scaling evaluation: "Manual review is a bottleneck — we have 10,000 test cases. Let's use LLM-as-judge: we pass the question, context, and answer to GPT-4o and ask it to score faithfulness." What is the primary advantage of LLM-as-judge over human evaluation?

5 / 5

A tech lead describes their quality framework: "Our eval suite checks 5 dimensions on every release: faithfulness, answer relevance, context recall, latency, and toxicity. We curate the eval dataset from production queries monthly." Why is eval dataset curation described as a continuous process rather than a one-time task?