Evaluation Metrics Vocabulary

BLEU, ROUGE, perplexity, win rate, pass@k, hallucination rate — the metrics used to measure LLM quality.

Key vocabulary

BLEU score — measures overlap between generated and reference text using n-gram precision; common in translation.
ROUGE — recall-oriented metric comparing generated summaries to reference summaries.
Perplexity — measures how well a language model predicts a text sample; lower is better.
Pass@k — fraction of problems where at least 1 of k generated solutions is correct; used for code generation.
Hallucination rate — proportion of model outputs containing factually incorrect or fabricated information.

0 / 5 completed

1 / 5

Your team reports a BLEU score of 42 for a translation model. What does the BLEU score measure?

2 / 5

A colleague says “our summarization model has a high ROUGE-L score.” ROUGE-L measures:

3 / 5

An eval report shows “pass@10 = 0.78” for a code generation model. This means:

4 / 5

In the context of RAG (Retrieval-Augmented Generation) evaluation, “faithfulness” refers to:

5 / 5

Why might a team choose human evaluation over automatic metrics for evaluating a creative writing model?