BLEU, ROUGE, perplexity, win rate, pass@k, hallucination rate — the metrics used to measure LLM quality.
Key vocabulary
BLEU score — measures overlap between generated and reference text using n-gram precision; common in translation.
ROUGE — recall-oriented metric comparing generated summaries to reference summaries.
Perplexity — measures how well a language model predicts a text sample; lower is better.
Pass@k — fraction of problems where at least 1 of k generated solutions is correct; used for code generation.
Hallucination rate — proportion of model outputs containing factually incorrect or fabricated information.
0 / 5 completed
1 / 5
Your team reports a BLEU score of 42 for a translation model. What does the BLEU score measure?
BLEU (Bilingual Evaluation Understudy) compares n-gram overlap between a generated translation and one or more reference translations. It measures precision — how much of the generated text appears in the reference. Despite being widely used, BLEU is criticized for not capturing meaning or fluency well.
2 / 5
A colleague says “our summarization model has a high ROUGE-L score.” ROUGE-L measures:
ROUGE-L uses the Longest Common Subsequence (LCS) between the generated and reference text. Unlike ROUGE-1/2 which count n-gram overlaps, ROUGE-L considers word order. It is a recall-oriented metric, rewarding summaries that cover important content from the reference.
3 / 5
An eval report shows “pass@10 = 0.78” for a code generation model. This means:
pass@k measures whether at least one of k generated samples solves a problem. pass@10 = 0.78 means that for 78% of problems, if you generate 10 solutions, at least one is functionally correct. Higher k gives more generous estimates, which is relevant when using models in best-of-N sampling pipelines.
4 / 5
In the context of RAG (Retrieval-Augmented Generation) evaluation, “faithfulness” refers to:
Faithfulness in RAG evaluation asks: does the generated answer only contain claims that are supported by the retrieved documents? It is distinct from answer relevance (does the answer address the question?) and context recall (were the right documents retrieved?). RAGAS is a popular framework that measures all three.
5 / 5
Why might a team choose human evaluation over automatic metrics for evaluating a creative writing model?
Human evaluation is preferred when automatic metrics cannot capture what matters. For creative writing, qualities like engagement, originality, and appropriate tone require human judgment. The tradeoff is cost and scalability — hence the rise of LLM-as-judge as a middle ground between cheap-but-flawed automatic metrics and expensive-but-accurate human evaluation.