Writing ML Model Cards in English: Structure, Vocabulary, and Examples
Learn how to write professional ML model cards in English — covering intended use, performance metrics, limitations, ethical considerations, and the vocabulary you need.
A model card is a short document that accompanies a machine learning model to communicate what it does, how it performs, and what its limitations are. Originally proposed by Google in 2018, model cards have become the standard way to document ML models for both internal teams and public releases. Writing a good model card in English requires understanding both the structure and the vocabulary of responsible AI communication.
What a Model Card Contains
A complete model card typically includes the following sections:
| Section | Contents |
|---|---|
| Model overview | Brief description of the model, its type, and its purpose |
| Intended use | The tasks and contexts the model is designed for |
| Out-of-scope use | Tasks or contexts the model is NOT designed for |
| Training data | A description of the data used to train the model |
| Evaluation data | The datasets used to measure model performance |
| Performance | Quantitative results broken down by relevant subgroups |
| Limitations | Known weaknesses, failure modes, and edge cases |
| Ethical considerations | Potential harms, bias findings, and mitigation measures |
| Caveats and recommendations | Additional guidance for users |
Writing the Intended Use Section
The intended use section tells users what the model is designed to do. Be specific — vague intended use statements invite misuse.
Template: “This model is designed for [primary task] using [input type] as input. It is intended for use by [target users] in [context].”
Example: “This model is designed to classify customer support tickets into one of 12 priority categories. It is intended for use by support operations teams in a human-in-the-loop workflow, where a human agent reviews and confirms the classification before it is acted upon.”
Writing the Limitations Section
The limitations section is one of the most important parts of a model card — and often the most neglected. Good limitations sections are specific and honest.
Key vocabulary:
| Term | Meaning |
|---|---|
| Bias | Systematic errors in model behaviour that unfairly affect certain groups |
| Fairness | The degree to which a model performs equitably across different demographic groups |
| Evaluation metric | A quantitative measure of model performance |
| Training data distribution | The statistical properties of the data used to train the model |
| Out-of-distribution | Inputs that differ significantly from the training data distribution |
| Subgroup performance | Model performance measured separately for different demographic or categorical groups |
| Calibration | The alignment between a model’s predicted probabilities and actual outcomes |
Limitations language patterns:
- “This model performs significantly worse on inputs from speakers with non-native English accents, as reflected in a 15-point accuracy drop in subgroup evaluation.”
- “The model was trained on data from 2019–2022 and may not accurately reflect current language patterns or cultural references.”
- “Performance on images captured in low-light conditions was not evaluated and should be treated as out of scope for the initial release.”
- “The model has not been evaluated for use outside of English-language text. Applying it to other languages is not supported and may produce unreliable results.”
Writing the Ethical Considerations Section
This section addresses potential harms and what has been done to mitigate them.
Structure:
- Identify the potential harm
- Describe the evidence or assessment that identified it
- State the mitigation in place
- Note any residual risk
Example: “The model may reinforce socioeconomic biases present in the training data. An internal bias audit found that the model assigned lower creditworthiness scores to applicants from certain postcode areas correlated with lower average income. To mitigate this, postcode was removed as a feature and replaced with more granular economic indicators. Residual geographic bias at a more local level has not been fully assessed.”
Performance Section Vocabulary
When reporting performance, always specify:
- The metric (accuracy, F1, AUC, etc.)
- The dataset it was measured on
- The conditions (date, data slice, etc.)
Example: “Overall accuracy on the test set is 91.3% (evaluated on 10,000 held-out examples from the 2023 evaluation dataset). Subgroup analysis by language variety revealed accuracy of 93.1% on American English and 87.4% on Indian English.”
Performance vocabulary:
- “disaggregated evaluation” — performance broken down by subgroup
- “held-out test set” — data that was not used during training or validation
- “confidence interval” — the range of plausible values for a metric
- “statistical significance” — whether a difference in performance is likely to be real or due to chance
Example Sentences
- “This model is intended for use as a drafting assistant for legal professionals — all output must be reviewed by a qualified solicitor before use in any legal context.”
- “Subgroup evaluation revealed a 12-point F1 score gap between male and female-coded names in the resume screening dataset; this finding has been escalated to the responsible AI review board.”
- “The model’s training data does not include examples from after March 2023; queries about recent events will produce unreliable or hallucinated responses.”
- “Out-of-scope use includes any application in a medical diagnosis context — the model has not been validated for clinical use and does not meet the requirements of a medical device.”
- “The model card documents all known limitations transparently so that downstream users can make informed decisions about whether this model is appropriate for their specific use case.”