English for ML Security Engineers: Adversarial Attacks, Poisoning, and Model Integrity
Learn the English vocabulary and natural discussion phrases used by ML security engineers covering adversarial examples, data poisoning, and model red-teaming.
ML security is one of the fastest-growing specialisms in the industry — and it has a vocabulary that blends classical security terminology with machine learning concepts. If you’re working on model integrity, threat modeling for AI systems, or red-teaming language models, you need precise English to communicate findings clearly. Ambiguity in a security report can mean a critical vulnerability gets ignored. This post teaches you the vocabulary and the natural register engineers use.
Core Vocabulary
Adversarial Example
An adversarial example is an input specifically crafted to fool a machine learning model — often indistinguishable to a human but causing the model to produce an incorrect or harmful output.
“The red team generated adversarial examples that caused the image classifier to misidentify stop signs as speed limit signs with 97% confidence.”
Key phrases: craft an adversarial example, the model is fooled by, adversarially perturbed input, imperceptible to humans.
Note the register: in security writing, engineers use craft (deliberate, skillful construction), not “make” or “create.” This sounds more precise and is expected in reports.
Adversarial Robustness
Adversarial robustness describes how well a model maintains correct behavior when subjected to adversarial inputs. A robust model degrades gracefully; a brittle model fails catastrophically.
“We measured adversarial robustness using PGD attacks at epsilon=8/255. The baseline model dropped to 12% accuracy; the adversarially trained model held at 61%.”
Useful phrases: evaluate robustness against, the model is brittle to, robustness-accuracy tradeoff, certifiable robustness.
Data Poisoning
Data poisoning is an attack where an adversary corrupts the training dataset to influence model behavior — either degrading overall performance or introducing hidden behaviors.
“The third-party dataset we ingested contained poisoned samples that suppressed the model’s ability to detect one specific malware family. We only caught it in the differential evaluation.”
Verbs: poison the training data, inject malicious samples, the dataset was compromised, detect poisoning via data auditing.
Backdoor Attack
A backdoor attack (also called a trojan attack) embeds a hidden trigger into a model during training. The model behaves normally on clean inputs but produces attacker-controlled outputs when it sees the trigger.
“The backdoored model classified every image normally — until the attacker placed a small yellow sticker in the corner. With the trigger present, it always predicted the target class.”
Phrases: plant a backdoor, activate the trigger, the model exhibits backdoor behavior, trigger pattern, clean-label backdoor (a more subtle variant).
Model Inversion
Model inversion is an attack where an adversary queries a model repeatedly to reconstruct sensitive training data — for example, recovering facial images from a face recognition API.
“We demonstrated a model inversion attack against the production API. Given only the top-5 confidence scores, we reconstructed recognizable images of people in the training set.”
Key phrases: reconstruct training data, the API leaks information about, inversion via repeated queries, membership exposure.
Membership Inference
A membership inference attack determines whether a specific data point was in a model’s training set. This has serious privacy implications for models trained on sensitive data.
“We ran a membership inference attack against the health risk model. The model’s overconfidence on training samples made inference trivial — we got 89% accuracy distinguishing members from non-members.”
Phrases: infer membership, the model overfits (which makes membership inference easier), member vs. non-member, shadow model attack.
Evasion Attack
An evasion attack causes a model to misclassify inputs at inference time — without modifying the model itself. Adversarial examples are the most common form of evasion attack.
“The spam filter was evaded by inserting zero-width Unicode characters between words. The model never saw this pattern during training.”
The word evade is important here — you evade a model, a filter, a detector. You don’t “skip” or “bypass” in formal security writing; you evade.
Model Stealing
Model stealing (also model extraction) is an attack where an adversary queries a model API extensively to train a surrogate model that approximates the victim model — stealing its functionality without access to weights.
“The attacker queried our API 2 million times over three weeks and trained a local surrogate. The surrogate achieved 94% fidelity on our benchmark — effectively stealing the model.”
Phrases: extract a model, train a surrogate, API-based model stealing, fidelity to the victim model.
Red-Teaming Models
Red-teaming (verb: to red-team) means adversarially probing a model to find failures, harmful outputs, or exploitable behaviors — before attackers do.
“We red-teamed the content moderation model for two weeks before launch. The team found it could be jailbroken with role-play prompts that shifted the conversational context.”
Phrases: red-team a model, the red team found, jailbreak (for language models specifically), adversarial probing, failure mode discovery.
Threat Modeling for ML Systems
Threat modeling in the ML context means systematically identifying what can go wrong — at training time, inference time, and deployment — and which adversaries are realistic.
“Before we start hardening the pipeline, we need to do a threat model. Who is the adversary? Do they have access to the training data, the model weights, or only the API?”
The question “Who is the adversary?” and “What is their capability?” are standard openers in threat modeling sessions. You’ll also hear: threat actor, attack surface, trust boundary, adversary capability.
Real IT Context: Phrases Engineers Actually Use
In security review meetings:
- “What’s our attack surface at inference time? Are we exposing confidence scores or just the top class?”
- “We should assume the adversary has white-box access — if we’re only robust to black-box attacks, that’s not a strong guarantee.”
In postmortems:
- “The poisoning went undetected because we had no data provenance tracking. We couldn’t tell which samples came from the compromised source.”
In code reviews for ML pipelines:
- “We’re logging full prediction probabilities here. That could enable a membership inference attack. Can we add noise or return only the top-1 label?”
Key Collocations
| Collocation | Meaning |
|---|---|
| craft an adversarial example | deliberately engineer a misleading input |
| plant a backdoor | embed a hidden trigger during training |
| poison the training data | corrupt samples to influence model behavior |
| evade a classifier | bypass a model at inference without modifying it |
| red-team a model | adversarially probe for failure modes |
| infer membership | determine if a sample was in the training set |
| steal a model | extract functionality via API queries |
| activate a trigger | cause a backdoored model to misbehave |
Practice
Write a two-paragraph threat model summary for a hypothetical ML system — for example, a fraud detection model exposed via an API. In the first paragraph, identify two realistic attack vectors using vocabulary from this post. In the second paragraph, describe what mitigations you would recommend. Use the verb forms and collocations from this post. Writing threat model summaries is a real deliverable in this field, and the English register matters: be specific, use passive constructions where appropriate (“the model was found to be vulnerable to…”), and avoid vague language.