Writing ML Model Cards in English: Structure, Vocabulary, and Examples

Learn how to write professional ML model cards in English — covering intended use, performance metrics, limitations, ethical considerations, and the vocabulary you need.

A model card is a short document that accompanies a machine learning model to communicate what it does, how it performs, and what its limitations are. Originally proposed by Google in 2018, model cards have become the standard way to document ML models for both internal teams and public releases. Writing a good model card in English requires understanding both the structure and the vocabulary of responsible AI communication.

What a Model Card Contains

A complete model card typically includes the following sections:

SectionContents
Model overviewBrief description of the model, its type, and its purpose
Intended useThe tasks and contexts the model is designed for
Out-of-scope useTasks or contexts the model is NOT designed for
Training dataA description of the data used to train the model
Evaluation dataThe datasets used to measure model performance
PerformanceQuantitative results broken down by relevant subgroups
LimitationsKnown weaknesses, failure modes, and edge cases
Ethical considerationsPotential harms, bias findings, and mitigation measures
Caveats and recommendationsAdditional guidance for users

Writing the Intended Use Section

The intended use section tells users what the model is designed to do. Be specific — vague intended use statements invite misuse.

Template: “This model is designed for [primary task] using [input type] as input. It is intended for use by [target users] in [context].”

Example: “This model is designed to classify customer support tickets into one of 12 priority categories. It is intended for use by support operations teams in a human-in-the-loop workflow, where a human agent reviews and confirms the classification before it is acted upon.”

Writing the Limitations Section

The limitations section is one of the most important parts of a model card — and often the most neglected. Good limitations sections are specific and honest.

Key vocabulary:

TermMeaning
BiasSystematic errors in model behaviour that unfairly affect certain groups
FairnessThe degree to which a model performs equitably across different demographic groups
Evaluation metricA quantitative measure of model performance
Training data distributionThe statistical properties of the data used to train the model
Out-of-distributionInputs that differ significantly from the training data distribution
Subgroup performanceModel performance measured separately for different demographic or categorical groups
CalibrationThe alignment between a model’s predicted probabilities and actual outcomes

Limitations language patterns:

  • “This model performs significantly worse on inputs from speakers with non-native English accents, as reflected in a 15-point accuracy drop in subgroup evaluation.”
  • “The model was trained on data from 2019–2022 and may not accurately reflect current language patterns or cultural references.”
  • “Performance on images captured in low-light conditions was not evaluated and should be treated as out of scope for the initial release.”
  • “The model has not been evaluated for use outside of English-language text. Applying it to other languages is not supported and may produce unreliable results.”

Writing the Ethical Considerations Section

This section addresses potential harms and what has been done to mitigate them.

Structure:

  1. Identify the potential harm
  2. Describe the evidence or assessment that identified it
  3. State the mitigation in place
  4. Note any residual risk

Example: “The model may reinforce socioeconomic biases present in the training data. An internal bias audit found that the model assigned lower creditworthiness scores to applicants from certain postcode areas correlated with lower average income. To mitigate this, postcode was removed as a feature and replaced with more granular economic indicators. Residual geographic bias at a more local level has not been fully assessed.”

Performance Section Vocabulary

When reporting performance, always specify:

  • The metric (accuracy, F1, AUC, etc.)
  • The dataset it was measured on
  • The conditions (date, data slice, etc.)

Example: “Overall accuracy on the test set is 91.3% (evaluated on 10,000 held-out examples from the 2023 evaluation dataset). Subgroup analysis by language variety revealed accuracy of 93.1% on American English and 87.4% on Indian English.”

Performance vocabulary:

  • “disaggregated evaluation” — performance broken down by subgroup
  • “held-out test set” — data that was not used during training or validation
  • “confidence interval” — the range of plausible values for a metric
  • “statistical significance” — whether a difference in performance is likely to be real or due to chance

Example Sentences

  1. “This model is intended for use as a drafting assistant for legal professionals — all output must be reviewed by a qualified solicitor before use in any legal context.”
  2. “Subgroup evaluation revealed a 12-point F1 score gap between male and female-coded names in the resume screening dataset; this finding has been escalated to the responsible AI review board.”
  3. “The model’s training data does not include examples from after March 2023; queries about recent events will produce unreliable or hallucinated responses.”
  4. “Out-of-scope use includes any application in a medical diagnosis context — the model has not been validated for clinical use and does not meet the requirements of a medical device.”
  5. “The model card documents all known limitations transparently so that downstream users can make informed decisions about whether this model is appropriate for their specific use case.”