Evaluation Metrics Vocabulary

BLEU, ROUGE, perplexity, win rate, pass@k, hallucination rate — the metrics used to measure LLM quality.

Key vocabulary

  • BLEU score — measures overlap between generated and reference text using n-gram precision; common in translation.
  • ROUGE — recall-oriented metric comparing generated summaries to reference summaries.
  • Perplexity — measures how well a language model predicts a text sample; lower is better.
  • Pass@k — fraction of problems where at least 1 of k generated solutions is correct; used for code generation.
  • Hallucination rate — proportion of model outputs containing factually incorrect or fabricated information.
0 / 5 completed
1 / 5
Your team reports a BLEU score of 42 for a translation model. What does the BLEU score measure?