Model Card Writing Guide for ML Engineers

Learn how to write a professional model card in English — structure, required sections, evaluation reporting, and ready-to-use phrases for documenting AI models.

A model card is a short document that describes an ML model — its purpose, training data, evaluation results, limitations, and ethical considerations. Originally proposed by Google researchers in 2018, model cards have become the standard for responsible model documentation. If you work with ML models professionally, you will write and maintain model cards. This guide shows you how to do it in clear, professional English.


What Is a Model Card?

A model card is a standardised documentation format for ML models, analogous to a drug package insert or nutritional label. It gives users, auditors, and downstream teams the information they need to decide whether to use a model, understand its constraints, and deploy it responsibly.

“Before deploying the vendor’s model, our compliance team asked for the model card — specifically the bias evaluation section and the intended use cases.”


Standard Model Card Structure

Section 1: Model Details

Introduce the model with metadata anyone can quickly scan.

Include:

  • Model name and version
  • Model type (e.g., text classification, image generation, embedding model)
  • Organisation / team that developed it
  • Date of training / release
  • Training framework and key hyperparameters (optional)
  • Contact for questions

Template:

## Model Details

- **Model name**: SupportClassifier v2.1
- **Model type**: Multi-label text classifier
- **Architecture**: Fine-tuned DistilBERT
- **Developed by**: ML Platform Team, Acme Corp
- **Date**: March 2026
- **Framework**: PyTorch 2.2, Hugging Face Transformers 4.38
- **Contact**: mlteam@acme.com

Section 2: Intended Use

Be explicit about what the model is designed for — and what it is not.

Intended use:

“This model is designed to classify customer support tickets into one of 14 predefined categories to enable automated routing to the appropriate support queue.”

Out-of-scope use:

“This model is not designed for legal document classification, medical triage, or any use case where errors could cause material harm. It should not be used without human oversight in high-stakes automated workflows.”

Critical English patterns:

  • “is designed to…” — states purpose
  • “is intended for…” — recommended audience
  • “should not be used for…” — explicit exclusions
  • “this model is not appropriate for…” — softer exclusion language

Section 3: Training Data

Document what data was used, where it came from, and any important caveats.

Include:

  • Dataset name and version
  • Size (number of examples)
  • Date range (when was the data collected?)
  • Source (internal, public, licensed third-party)
  • Preprocessing steps
  • Any known biases in the data

Template:

## Training Data

The model was trained on the internal Customer Support Corpus v3 (Q1 2023 – Q4 2025):

- **Size**: 120,000 labelled support tickets
- **Languages**: English only
- **Label distribution**: Balanced across 14 categories (≈8,500 examples per class)
- **Preprocessing**: PII removed via named-entity redaction; tickets shorter than 10 tokens excluded
- **Known limitations**: Underrepresents tickets from APAC region (≈8% of training data). Performance on APAC tickets may be lower.

Section 4: Evaluation Results

This is the most important section for stakeholders deciding whether to trust and deploy the model.

Include:

  • Evaluation dataset (name, size, how it was constructed)
  • Primary metrics (with numbers)
  • Breakdown by subgroup if applicable
  • Comparison to baseline

Template:

## Evaluation Results

Evaluated on the held-out Support Test Set (15,000 tickets, Q1 2026):

| Metric | Score |
|--------|-------|
| Accuracy | 94.2% |
| Macro F1 | 0.91 |
| Macro Precision | 0.92 |
| Macro Recall | 0.90 |

**Baseline comparison**: The previous rule-based routing system achieved 71% accuracy
on the same test set.

**Subgroup performance** (by region):
| Region | Accuracy |
|--------|----------|
| North America | 95.8% |
| EMEA | 94.1% |
| APAC | 88.3% |

*Note: APAC performance is lower, consistent with underrepresentation in training data.*

Section 5: Ethical Considerations

Discuss bias, fairness, and potential harms. This section is required for any model deployed to end users.

Template:

## Ethical Considerations

**Bias**: The model may perform less accurately on tickets written in non-native English
due to lower representation of such language patterns in the training corpus.

**Fairness**: Routing errors may disproportionately affect APAC customers. We recommend
monitoring error rates by region in production and rebalancing training data before v3.

**Privacy**: Training data has been through PII redaction. No customer names, emails,
or account numbers are present in stored artefacts.

**Misuse risk**: Misclassification could delay critical support requests. Human
review is required for any ticket classified under 'Critical Outage' or 'Legal'.

Useful phrases:

  • “may disproportionately affect…”
  • “we recommend monitoring…”
  • “human review is required for…”
  • “the model is not designed for autonomous decision-making in…”

Section 6: Limitations and Known Issues

The model card should be honest about failure modes.

Template:

## Limitations

- The model was trained on English-only data and will perform poorly on
  multilingual tickets or heavily code-switched text.
- Tickets shorter than 15 tokens may produce unreliable classifications.
- The model does not distinguish between 'Billing – refund' and 'Billing – dispute'
  in ambiguous cases; see error analysis for examples.
- Performance degrades on ticket categories introduced after the training cutoff
  (December 2025).

Section 7: Deployment and Infrastructure Notes

Operational details for the teams deploying the model.

Include:

  • Inference hardware requirements
  • Expected latency
  • Serving framework
  • Monitoring recommendations

Template:

## Deployment Notes

- **Hardware**: CPU inference sufficient; single-instance p99 latency ~45ms
- **Memory**: 512MB RAM minimum; 1GB recommended
- **Serving**: TorchServe 0.9; model is packaged as `.mar` artefact
- **Monitoring**: Track accuracy, F1 per category, and inference latency.
  Alert if accuracy drops below 90% over a rolling 7-day window.
- **Retraining trigger**: Quarterly or when accuracy drops below 88%.

Model Card Writing Language Tips

Be specific, not vague

“The model performs well on most inputs.”“The model achieves 94% accuracy on the held-out test set of 15,000 tickets.”

Be honest about limitations

“The model handles all ticket types.”“The model may produce unreliable outputs for ticket types not present in the training data.”

Use passive voice for standard documentation statements

  • “The model was trained on…”
  • “Evaluation was performed on…”
  • “PII was removed before training.”

Use active voice for recommendations

  • “We recommend monitoring…”
  • “Teams should avoid using this model for…”

Practice

Deepen your ML documentation vocabulary with the Applied AI & LLMs exercise set and the AI/ML Engineer learning path.