Explaining Inter-Annotator Agreement to Non-Statistical Stakeholders
How to explain inter-annotator agreement, kappa scores, and annotation quality to product managers and business stakeholders who do not have a statistics background.
The Communication Gap Around IAA
Inter-annotator agreement (IAA) is one of the most important quality signals in supervised machine learning — and one of the hardest to communicate to non-technical stakeholders. When you say “our Cohen’s kappa is 0.67,” a product manager hears numbers without context. When you say “our annotators agree 67% of the time,” they underestimate the quality, because 67% sounds low.
The real number is neither of those things. Kappa measures agreement beyond chance, and communicating that distinction is what this guide is about.
What IAA Actually Measures (Plain English Version)
When multiple annotators independently label the same data, they will agree on some items simply by chance — even if they are guessing randomly. Cohen’s kappa corrects for this chance agreement.
The analogy that works best with most non-technical audiences:
“If two people flip a coin to decide whether to label something POSITIVE or NEGATIVE, they will agree half the time by pure chance. Cohen’s kappa asks: how much better than coin-flipping are our annotators? A kappa of 0.7 means our annotators agree substantially — not because they are guessing the same way, but because they have genuinely learned the same task.”
Kappa Score Interpretation
| Kappa range | Interpretation |
|---|---|
| < 0.20 | Slight agreement |
| 0.21 – 0.40 | Fair agreement |
| 0.41 – 0.60 | Moderate agreement |
| 0.61 – 0.80 | Substantial agreement |
| 0.81 – 1.00 | Near-perfect agreement |
These ranges (from Landis and Koch, 1977) are the standard reference. When presenting a kappa score, always contextualise it within this scale: “Our kappa of 0.71 falls in the substantial agreement range, which is strong for a subjective task like sentiment analysis.”
Common Misconceptions and How to Correct Them
”67% agreement sounds bad”
What stakeholders think: Most decisions require higher than 67% consensus.
What to say: “The raw agreement percentage is not the right number to focus on. Our annotators agree 87% of the time — but some of that is just chance because many items are obviously one label. The kappa of 0.71 tells us how much of that agreement is genuine and above random chance. In the industry, 0.7 is considered strong for subjective tasks."
"Can’t we just pick the most popular label?”
What stakeholders think: Majority vote eliminates disagreement.
What to say: “Majority vote gives us a label, but it does not tell us whether the task is well-defined. If annotators disagree at 30%, we have a problem with our guidelines or our task design — taking a majority vote just hides the problem without solving it. Low kappa is a signal to fix the task definition, not to take a vote."
"Our model needs 95%+ accuracy, so 70% annotator agreement is too low”
What stakeholders think: Model accuracy requires higher human agreement.
What to say: “The model learns from the consensus labels, not from each individual annotator. The individual kappa reflects task difficulty and guideline quality. A kappa of 0.7 on a subjective task typically produces high-quality training data when consensus labels are used. The 95% accuracy target for the model is a separate question about the evaluation dataset.”
What to Say When IAA Is Low
When you report a kappa below 0.6, stakeholders may conclude that the data is unusable or that the annotators are incompetent. Neither is necessarily true.
Constructive framing:
“Our kappa of 0.48 indicates that annotators are applying the guidelines inconsistently on a significant subset of items. This is a signal to improve our guidelines rather than our annotators. We have identified three label categories where most disagreements occur, and we are running a targeted guideline revision and recalibration this week.”
Key points to convey:
- Low IAA is a task design problem, not an annotator failure
- It is diagnosable — you know which categories are problematic
- It is fixable — guideline revision + recalibration typically increases kappa substantially
- Low IAA discovered before training is valuable; it would be much worse to discover it through model failure
Example Non-Technical Explanations
- “Think of kappa as measuring how well our annotators are reading from the same rulebook. A kappa of 0.75 means they are reading from essentially the same rulebook; a kappa of 0.4 means they are improvising too much.”
- “Our raw agreement rate is 83%, but on a simple two-class task, random chance would give us 50% agreement. Kappa adjusts for that, which is why our kappa of 0.66 is the more meaningful number.”
- “We are comfortable proceeding to model training with a kappa of 0.72 — this is within the range that peer-reviewed ML papers report as acceptable for tasks of this type.”
- “The drop in kappa from 0.78 to 0.61 after we added the new label category tells us that annotators are not yet confident about when to use it — we need two or three more worked examples in the guidelines.”
- “Kappa of 1.0 would mean our annotators always agree, which sounds ideal but can also indicate the task is trivially easy and not capturing the nuance we need for a useful model.”
Choosing the Right Statistic
Different IAA statistics are appropriate for different task types.
| Task type | Recommended statistic |
|---|---|
| Two annotators, two labels | Cohen’s kappa |
| Multiple annotators, two labels | Fleiss’ kappa |
| Multiple annotators, ordinal ratings | Weighted kappa or Intraclass Correlation (ICC) |
| Span-level annotation (NER, etc.) | F1-based IAA or span-level kappa |
When presenting to stakeholders, you rarely need to explain which statistic you used unless they ask. What matters is the interpretation: “Our agreement score is X, which is considered Y according to standard benchmarks.”