Data Science & ML Engineering

Complete English Guide for Data Scientists & ML Engineers

Communicate experiment results, defend model choices, tell data stories to stakeholders, read ML papers, and discuss feature engineering — the full English of modern data science and machine learning.

8 sections · 25+ internal practice links · Intermediate – Advanced

Why English Matters for Data Scientists & ML Engineers

Data science and machine learning occupy a unique position in modern tech organisations: the work is deeply technical, but the value it creates is only realised when it is communicated effectively to decision-makers who are often not technical. A model that achieves excellent benchmark performance but whose results cannot be explained to a product manager will not be deployed. An experiment that reveals an important insight about user behaviour will not change product strategy if the data story is buried in a Jupyter notebook that no one reads.

The English of data science and ML spans a wide range of registers and audiences. Writing a Jupyter notebook, you are writing for technical peers who want to follow your analytical reasoning. Presenting experiment results in a weekly data review, you are writing for a mixed audience that includes engineers, product managers, and sometimes executives. Writing a model card for a deployed model, you are writing formal documentation for a broad range of stakeholders. Each context demands different vocabulary, different levels of technical detail, and different structures.

ML engineering adds the operational dimension: communicating about model training pipelines, feature stores, data contracts, model serving infrastructure, and production monitoring. These are topics that require vocabulary from both ML research and software engineering, and the ability to discuss them with both data scientists and platform engineers.

There is also the dimension of reading and engaging with the ML research community. ML papers on arXiv, NeurIPS proceedings, and blog posts from major research labs are written in a technical English with specific conventions — the abstract, the related work section, the contributions statement, the ablation study. Being able to read this literature efficiently and then synthesise it for your team is a significant competitive advantage.

This guide covers all of these registers. Work through each section to build the vocabulary and communication skills that will make your data science and ML work more impactful.

Section 1: Experiment Result Communication

Communicating experiment results is perhaps the most frequent and consequential writing a data scientist does. Whether in a Slack message, a Google Doc, a Jupyter notebook, or a presentation, the structure and vocabulary of experiment write-ups follows conventions that are worth learning explicitly.

The Standard Experiment Write-Up Structure

A clear experiment write-up covers: Motivation (why did we run this experiment?), Hypothesis (what did we expect to find?), Method (what did we do?), Results (what did we observe?), Interpretation (what does this mean?), and Conclusion/Recommendation (what should we do next?). Each section has its own language patterns.

Motivation: "We observed a significant drop in conversion rate on the checkout page following the UI redesign in March. This experiment was designed to investigate whether the position of the CTA button is contributing to the drop." Hypothesis: "We hypothesised that moving the CTA button above the fold would increase the checkout completion rate." Method: "We ran a 50/50 A/B test over a two-week period from April 3 to April 17, with n=48,000 users per variant."

Results and Statistical Language

Results sections require precise statistical language. Key formulations: "The treatment group showed a 3.2% relative increase in conversion rate compared to the control group (8.4% vs. 8.1%, p=0.03)." The elements: the direction (increase/decrease), the magnitude (3.2%), the comparison (treatment vs. control), the absolute values (8.4% vs. 8.1%), and the p-value or confidence interval. Always report both relative and absolute changes — "a 40% improvement" is meaningless without the baseline (40% improvement from 1% to 1.4% is very different from 40% improvement from 10% to 14%).

Statistical significance language: "The result is statistically significant at the 95% confidence level (p=0.03)." / "We did not observe a statistically significant effect (p=0.31) — the result is consistent with random variation." / "The confidence interval for the lift is [1.1%, 5.3%] — we are 95% confident the true effect is within this range." Practical significance: "While the result is statistically significant, the effect size of 0.3% absolute lift may not be practically significant given the implementation cost."

Negative Results and Null Findings

Communicating null or negative results requires careful language: "The experiment did not support our hypothesis — we observed no significant difference between variants (p=0.42). This suggests that button position is not the primary driver of the conversion drop. We recommend investigating [alternative hypothesis] next." Null results have value — state clearly what was learned and what should be tried next.

Section 2: Model Evaluation Vocabulary

Model evaluation generates a dense vocabulary of metrics, benchmarks, and evaluation frameworks. Being able to discuss these fluently — explaining them to non-technical stakeholders, comparing model versions in team reviews, and writing model cards for deployed models — is a core data science communication skill.

Classification Metrics Language

The precision-recall trade-off is one of the most commonly discussed topics in ML evaluation: "Precision measures how many of our positive predictions were correct. Recall measures how many of the actual positives we correctly identified. In this use case — detecting fraudulent transactions — we prioritise recall over precision because a missed fraud is more costly than a false alarm." The F1 score "balances precision and recall." AUC-ROC "measures discriminative ability across all classification thresholds." Always contextualise metrics with the business objective: "An AUC of 0.87 means the model correctly ranks a random positive example above a random negative example 87% of the time."

The standard benchmark reporting pattern: "This model achieves an F1 score of 0.84 on the test set, a 6-point improvement over the baseline (0.78). On the held-out evaluation set matching production distribution, F1 is 0.81." Key conventions: always specify which set (train/validation/test/held-out); always compare to a baseline; note distribution shifts between evaluation sets and production.

Regression and Ranking Metrics

Regression: "The model achieves an RMSE of 2.3 on the test set, representing a 15% improvement over the previous model (RMSE 2.7). The MAE is 1.8, indicating that on average the model's predictions are off by 1.8 units." For ranking: "The model achieves an NDCG@10 of 0.73, compared to 0.68 for the baseline — a relative improvement of 7.4%."

Model Comparison and Ablation Language

When comparing models: "Model A outperforms Model B on accuracy (93% vs. 89%) but has 3x higher inference latency (85ms vs. 28ms). Given our real-time serving requirements, Model B is the better production choice." Ablation study language: "We conducted an ablation study to understand the contribution of each feature group. Removing the temporal features degraded performance by 4.2 F1 points, confirming their importance. Removing the demographic features had a negligible effect (0.3 F1 points), suggesting we can simplify the feature set without significant performance loss."

Section 3: Data Storytelling & Presentation

Data storytelling is the ability to take numbers and findings and present them as a coherent narrative that drives understanding and action. This is perhaps the most valued communication skill in data science, and it is almost entirely a language and structure challenge — the data is often not the problem, the framing is.

The Data Story Structure

A data story should follow the narrative arc of: Situation (what is the current state?), Complication (what is the problem or question?), Resolution (what does the data reveal?), and Recommendation (what should we do?). This structure is familiar to business audiences and ensures that the data serves the decision, rather than being presented as an end in itself. "Our user retention has been declining for three quarters. We hypothesised that onboarding friction was a key driver. Analysis of the first-week activity data supports this hypothesis — users who complete three or more core actions in week one have 4x higher 90-day retention. My recommendation is to redesign the onboarding flow to guide users to these three actions."

Chart and Visualisation Description Language

When presenting data visualisations, provide verbal narration that highlights the key insight, not just the title: "This chart shows monthly active users over the past 18 months. The key takeaway is the sharp decline in September that coincides with the pricing change. What's interesting here is that the absolute number of users fell, but the revenue per user actually increased — suggesting we retained higher-value users." Guide the audience to the insight rather than expecting them to find it themselves.

Section 4: Stakeholder Reporting Language

Reporting data science and ML work to non-technical stakeholders — product managers, business leaders, and executives — requires translating technical concepts into business language without oversimplifying to the point of inaccuracy. This translation is a skill that separates effective data scientists from brilliant-but-isolated ones.

Translating Technical Metrics to Business Impact

The key technique is to always anchor technical metrics to business outcomes: Instead of "the model achieves 91% precision," say "the model correctly identifies 91% of fraudulent transactions before they are processed, reducing fraud losses by an estimated £2.3M annually." Instead of "we reduced latency from 120ms to 45ms," say "the recommendation response time is now 3x faster — user sessions where recommendations appear within 50ms show 18% higher click-through rates."

Uncertainty language for stakeholders: "Our model predicts Q3 churn rate will be approximately 8.4%, with a confidence interval of ±1.2%. This means we're quite confident the actual churn will fall between 7.2% and 9.6%." Avoid implying false precision while still communicating a point estimate. "Based on current data, the expected value of deploying this model is approximately £400k in retained revenue — this estimate has significant uncertainty and should be treated as directionally correct rather than precise."

Explaining Model Limitations

Proactively communicating model limitations builds trust and prevents misuse: "This model performs well for users in the UK and EU markets, but we have insufficient training data for the APAC market — predictions for those users are less reliable." / "The model was trained on data from 2022-2024 — if market conditions change significantly, we'll need to retrain." / "This model is designed for [use case X] — using it for [use case Y] would be outside its validated scope."

Section 5: Reading & Discussing ML Papers

Reading ML research papers efficiently is a competitive skill for data scientists and ML engineers. The papers are written in a highly specialised academic English with specific structural and linguistic conventions. Once you understand these conventions, papers become much faster to parse.

ML Paper Structure and Language

ML papers follow a standard structure: Abstract (summary of contribution, method, and results — typically one paragraph), Introduction (motivation, problem statement, summary of approach, and contributions), Related Work (contextualising the work against prior art), Method/Approach (the technical description of what was done), Experiments (evaluation setup and results), and Conclusion. The "contributions" section — often bulleted in the introduction — lists what is new: "The main contributions of this work are: (1) we propose a novel attention mechanism that..., (2) we demonstrate state-of-the-art results on X benchmark..., (3) we provide an ablation study showing..."

Common ML paper phrases: "We propose..." / "We introduce..." / "We demonstrate that..." / "Concurrent work by [authors] independently proposes..." / "Our method outperforms [baseline] by X% on [benchmark]." / "To the best of our knowledge, this is the first work to..." / "We leave [limitation] for future work."

Discussing Papers with Your Team

When presenting a paper to your team, structure it as: what problem it solves, how it solves it, what results it achieves, and what is relevant for your work. "This paper from Google Brain proposes a new approach to [problem]. The key insight is [idea]. They show on [benchmark] that this outperforms the previous state of the art by 8%. The part most relevant for us is the [specific technique] — I think we could adapt it for [our use case]."

Section 6: Feature Engineering Vocabulary

Feature engineering — the process of creating informative input variables for ML models from raw data — has a specific vocabulary for describing features, discussing their properties, and communicating about the feature development process.

Feature Description Language

When documenting or discussing features, describe: what the feature measures, how it is computed, what time window it covers, and what signal you expect it to capture. "user_7day_purchase_count — the count of completed purchases by the user in the 7 days preceding the prediction date. This feature is designed to capture recent purchase frequency as a signal for propensity to buy." Good feature documentation is precise about the time window (7 days preceding what?), the aggregation (count, sum, mean, max), and the hypothesis (why do you expect this to be predictive?).

Feature Store and Data Contract Language

Feature stores are infrastructure for storing and serving pre-computed features. Vocabulary: "feature definition" (the specification of how a feature is computed), "feature freshness" (how up-to-date the feature values are), "point-in-time correctness" (ensuring that features used for training reflect only data available at the time of the prediction, to avoid target leakage), "backfilling features" (computing historical feature values for retraining). "We need to add this feature to the feature store — it has a 1-hour freshness requirement and needs to be backfilled for the past 12 months for retraining."

Feature Selection and Importance

"Feature importance" describes how much each feature contributes to model predictions. "This feature has high importance in the model — removing it degrades performance by 4.2 F1 points." "The user age feature shows high correlation with the target, but it also correlates strongly with user_signup_date — these features are collinear, which may cause instability in the model." "We need to remove the [feature] from the training data — it contains future information that would not be available at inference time, causing target leakage."

Section 7: ML in Production Language

Deploying and maintaining ML models in production requires vocabulary that bridges data science and platform engineering. ML engineers need to communicate about model serving, monitoring, retraining pipelines, and production incidents with both data scientists and software engineers.

Model Serving and Deployment Language

"Model serving" refers to the infrastructure that makes a trained model available for predictions. Key concepts: "inference latency" (time to return a prediction — "our model has a p99 inference latency of 120ms"), "throughput" (predictions per second), "batch inference" (running predictions on many inputs at once, typically offline), "real-time inference" (predictions computed on-demand for individual requests), "model versioning" (tracking which model version is deployed), and "A/B testing models in production" (serving two model versions to different user segments to compare performance).

Model Monitoring and Drift Language

Production ML models require monitoring for: "data drift" (the input data distribution has changed from what the model was trained on), "concept drift" (the relationship between features and the target has changed), "model degradation" (performance on the target metric is declining), and "prediction drift" (the distribution of model outputs has shifted). "The model's precision has degraded by 8 percentage points over the past month. Analysis of the input data shows significant data drift in the user_location feature — this is likely related to the geographic expansion to new markets." "We're seeing prediction drift — the model is now predicting 'high risk' for 40% of users, compared to 18% when it was deployed. This may indicate that the model is no longer calibrated to current user behaviour."

Most Useful Vocabulary & Phrases for Data Scientists & ML Engineers

statistically significant
'The result is statistically significant at p=0.03 — we can reject the null hypothesis.'
confidence interval
'The 95% CI for the lift is [1.1%, 5.3%] — the true effect likely falls in this range.'
baseline
'The baseline model achieves 78% accuracy — our new model achieves 86%, a relative improvement of 10%.'
ablation study
'We conducted an ablation study removing each feature group — temporal features contributed the most to performance.'
data drift
'We detected data drift in the location feature — the model should be retrained on more recent data.'
target leakage
'This feature contains future information — using it in training would cause target leakage.'
precision-recall trade-off
'We tuned the classification threshold to prioritise recall — we accept more false positives to reduce missed fraud.'
point-in-time correctness
'The feature store enforces point-in-time correctness — training features cannot contain data from after the label date.'
feature importance
'SHAP values show that purchase_frequency is the top feature by importance.'
inference latency
'The model's p99 inference latency is 80ms — within our 100ms SLO for the recommendation service.'
model card
'Before deployment, I'll write a model card documenting the intended use, limitations, and evaluation results.'
holdout set
'We reserved 10% of data as a holdout set — it has not been used for any model selection decisions.'
overfitting
'The model is overfitting — training accuracy is 97% but validation accuracy is 72%.'
SOTA (state of the art)
'This approach achieves SOTA on the [benchmark] at the time of publication.'
this model achieves X on benchmark Y
'This model achieves an F1 of 0.91 on the SQuAD benchmark, surpassing the previous best by 2.3 points.'
directionally correct
'The estimate is directionally correct — treat it as an order-of-magnitude indicator, not a precise forecast.'
retraining pipeline
'The retraining pipeline runs weekly and automatically promotes the new model if validation metrics improve.'
feature freshness
'This feature has a 1-hour freshness requirement — the feature store updates it every 30 minutes.'
calibration
'The model is well-calibrated — a predicted probability of 70% corresponds to actual outcomes 70% of the time.'
transfer learning
'We used transfer learning from a pretrained BERT model — fine-tuning on our domain data improved F1 by 12 points.'

Recommended Learning Path for Data Scientists & ML Engineers

  1. 1
    Data Science & ML Vocabulary set

    Build your foundational vocabulary for ML concepts, statistical methods, and data science tooling.

  2. 2
    ML Language exercises

    Practise using ML vocabulary in context — experiment descriptions, model comparisons, and training pipeline communication.

  3. 3
    AI Model Evaluation Language

    Precision-recall language, benchmark reporting, the "this model achieves X on benchmark Y" pattern, and ablation study vocabulary.

  4. 4
    Numbers & Data Language exercises

    Presenting statistics, percentages, confidence intervals, and comparative figures clearly and precisely.

  5. 5
    Data Visualisation Language exercises

    Describing charts, narrating insights, and guiding audiences through data-driven presentations.

  6. 6
    Tech-to-Business Communication exercises

    Translating technical ML results into business language for product managers and executives.

  7. 7
    Data Engineering Language exercises

    Feature stores, data pipelines, data contracts, and the operational vocabulary of ML infrastructure.

  8. 8
    Research English exercises

    Reading ML papers efficiently, understanding academic paper structure, and synthesising research findings for your team.

  9. 9
    AI & ML Interview Questions

    Practice for data science and ML engineering interviews — model evaluation discussions, experiment design, and ML system design.

Exercise Sets for Data Scientists & ML Engineers

Practise the vocabulary and communication patterns covered in this guide with these focused exercise sets:

Vocabulary exercises

Collocations & interview preparation