AI Leaderboard & Ranking Vocabulary

LMSYS Chatbot Arena, Elo ratings, HELM, Open LLM Leaderboard, contamination, and benchmark gaming concerns.

Key vocabulary

LMSYS Chatbot Arena — a crowdsourced leaderboard where users rate model responses in blind pairwise comparisons.
Elo rating — a score derived from pairwise win/loss results; higher Elo means more wins against stronger opponents.
Contamination — when benchmark test data appears in a model’s training set, inflating its score unfairly.
Benchmark gaming — optimizing specifically for leaderboard metrics without improving real-world capability.
HELM (Holistic Evaluation of Language Models) — a benchmark suite measuring models across many scenarios and metrics simultaneously.

0 / 5 completed

1 / 5

LMSYS Chatbot Arena rankings are based on:

2 / 5

A model climbs 50 Elo points on the leaderboard after a fine-tuning run. What does this indicate?

3 / 5

A researcher says “we suspect the model was trained on the eval set.” This concern is called:

4 / 5

HELM (Holistic Evaluation of Language Models) differs from single-task benchmarks because it:

5 / 5

A company releases a model that tops the Open LLM Leaderboard on every task but performs poorly for users in production. This gap is best described as: