Writing Annotation Guidelines That Annotators Actually Follow

Learn how to write clear, effective annotation guidelines for machine learning datasets — structure, plain language, decision trees, worked examples, and edge case documentation.

Why Most Annotation Guidelines Fail

Annotation guidelines fail for one of three reasons: they are too vague to guide consistent decisions, too long to be read carefully, or too abstract to apply to real examples. The result is high inter-annotator disagreement, expensive adjudication, and noisy training data.

Writing guidelines that annotators actually follow is a writing skill — specifically, the skill of writing instructions that are unambiguous, proportionate, and illustrated with the right examples.


Guideline Structure

A well-structured annotation guideline document follows a predictable pattern. Annotators who read many guidelines learn to navigate this structure; departing from it adds cognitive load.

SectionPurpose
OverviewThe task in two to three sentences: what is being labelled, and why
Label schemaThe complete list of possible labels with definitions
Decision procedureStep-by-step instructions for applying each label
Worked examplesFully annotated examples for each label, including borderline cases
Edge casesSpecific scenarios that require special handling, with guidance
Frequently asked questions (FAQ)Answers to questions from annotator calibration sessions

Writing Definitions That Work

Label definitions must be necessary and sufficient — they must include every case the label applies to, and exclude every case it does not.

Weak definition (too broad):

“Label as POSITIVE any text that expresses a good feeling.”

This fails because “good feeling” is subjective, includes irony, and does not distinguish sentiment toward the product from sentiment toward unrelated topics.

Strong definition:

“Label as POSITIVE when the author explicitly expresses satisfaction, approval, or praise for the product or service described in the review, using language that indicates a genuine positive evaluation.”

Key vocabulary for definition writing:

PhraseUsage
”applies when”Introduces the condition for using a label
”does not apply when”Introduces exclusion criteria
”regardless of”Signals that a factor should be ignored
”even if”Introduces a condition that does not change the label
”in cases where X and Y both apply, prefer X”Resolves label conflicts

Decision Trees

For tasks with more than two labels, or labels with overlapping conditions, a decision tree reduces errors more than written text alone.

A well-formed decision tree for annotation:

  1. Uses yes/no questions, not open-ended ones
  2. Is deterministic — every path leads to exactly one label
  3. Handles tie-breaking explicitly
  4. Matches the worked examples exactly

Example structure:

Does the text contain a direct comparison to a competitor?
  → YES: Does it claim superiority?
         → YES: label as COMPARATIVE_POSITIVE
         → NO:  label as COMPARATIVE_NEUTRAL
  → NO:  Does it describe a product feature?
         → YES: label as FEATURE_DESCRIPTION
         → NO:  label as OTHER

Worked Examples: The Most Underused Element

Most guidelines include too few examples and do not include enough borderline ones. Annotators agree on clear-cut cases without examples. They need guidance on the edges — cases where two labels seem equally applicable.

For each label, include:

  • One clear-cut positive example“This is the label, and here is why.”
  • One clear-cut negative example“This is NOT the label, even though it might look like it.”
  • One borderline example“This is close to both X and Y. We label it X because…”

The borderline example is the most valuable. It documents the committee’s decision on a hard case and prevents annotators from splitting on the same cases repeatedly.


Plain Language Principles for Instructions

Instructions must be understood by annotators with varying levels of domain expertise. Apply these principles:

  1. Use active voice for instructions. “Label the sentence” not “The sentence should be labelled.”
  2. Use short sentences. Aim for under 25 words per instruction sentence.
  3. Define terms on first use. Do not assume annotators share your vocabulary.
  4. Avoid double negatives. “Do not label as negative unless…” is harder to process than “Label as negative only when…”
  5. Use consistent terminology. If you call something a “span,” call it a “span” throughout — do not alternate with “segment” or “text region.”

Edge Cases Section

The edge case section pre-empts the most common sources of disagreement. It is built from the questions that arise in annotator calibration sessions.

Format each edge case as:

Edge case: [Description of the scenario] Guidance: [Explicit instruction] Example: [Concrete example with the correct label] Rationale: [Brief explanation of why this guidance was chosen]


Example Annotation Guideline Sentences

  1. “Label as ENTAILMENT when the hypothesis can be logically inferred from the premise — that is, if the premise is true, the hypothesis must also be true.”
  2. “Do not apply the HARMFUL label to hypothetical scenarios described in a clearly fictional context; reserve it for content that could directly enable harm in the real world.”
  3. “If the annotator is uncertain between NEUTRAL and NEGATIVE, prefer NEUTRAL — this task is biased toward false negatives rather than false positives.”
  4. “This example was discussed in calibration and classified as BORDERLINE_POSITIVE; annotators should not spend more than 30 seconds on cases that seem similar — apply BORDERLINE_POSITIVE and move on.”
  5. “The following example illustrates a common error: annotators frequently label satirical praise as POSITIVE, but satire does not reflect genuine sentiment and must be labelled IRONIC.”

Testing Your Guidelines

Before deploying guidelines to a full annotator pool, run a calibration session with three to five annotators on a shared sample of 50–100 items. Calculate Cohen’s kappa. Any dimension with kappa below 0.6 signals a definition or instruction problem, not an annotator problem.

Revise the guideline — not the annotators’ training — and run the calibration again.