5 collocation exercises on preparing data for machine learning.
0 / 5 completed
1 / 5
Before training, you ___ the data.
You clean the data — removing duplicates, fixing errors and handling missing values so the model trains on quality input. Clean is the standard collocation (data cleaning, clean dataset). Wash off, scrub up and polish out are not idiomatic in ML. Data cleaning is often the most time-consuming part of a project, because models are only as good as the data they learn from.
2 / 5
To create useful inputs, you ___ features.
You engineer features — transforming raw data into informative inputs (ratios, aggregates, encodings) that help the model learn. Feature engineering is core ML vocabulary, and the verb is engineer. Craft off, forge up and build out are not the collocation. Well-engineered features often improve performance more than swapping algorithms, especially with classical models on tabular data.
3 / 5
To evaluate fairly, you ___ a dataset.
You split a dataset — dividing it into training, validation and test sets so you can evaluate on data the model has not seen. Split is the fixed term (train/test split). Cleave off, part up and cut out are not idiomatic. A proper split prevents overfitting estimates from being misleading; the test set must stay untouched until final evaluation to give an honest measure of generalisation.
4 / 5
For skewed targets, you ___ the classes.
You balance the classes — addressing class imbalance through resampling or weighting so a rare class is not ignored. Balance collocates with classes and dataset (class balancing). Level off, even up and equal out are not standard. Without balancing, a model can score high accuracy by always predicting the majority class while completely failing to detect the minority cases that matter most.
5 / 5
To expand training data artificially, you ___ it.
You augment data — generating additional training examples by transforming existing ones (rotating images, paraphrasing text). Data augmentation is the standard term, and the verb is augment. Pad off, swell up and bulk out are not idiomatic. Augmentation increases effective dataset size and diversity, improving generalisation and reducing overfitting, which is especially valuable when labelled data is scarce.