ML evaluation has its own precise vocabulary. This quiz covers the standard collocations for evaluating performance, benchmarking models, running ablations, and reporting metrics.
0 / 5 completed
1 / 5
Fill in: 'Before deploying to production, the team needed to ___ the model's performance on the held-out test set.'
We 'evaluate performance' — 'evaluate' is the ML-standard term for rigorously assessing a model against defined metrics on a test dataset. 'Test' is close but more informal; 'measure performance' focuses on the act of recording numbers; 'assess' is broader and less specific to the structured evaluation protocols used in ML.
2 / 5
Fill in: 'They decided to ___ the new transformer model against the previous LSTM baseline on all five tasks.'
We 'benchmark a model' — 'benchmark' specifically means measuring performance against a defined standard or set of tasks for systematic comparison. 'Compare' describes the result of benchmarking but is less precise; 'test' is informal; 'evaluate' is close but does not carry the standardised comparison implied by 'benchmark'.
3 / 5
Fill in: 'To understand which features mattered, the researchers decided to ___ ablations on each input component.'
We 'run ablations' — 'run ablations' is the ML research standard for systematically removing or disabling components to measure their contribution. 'Conduct ablations' is more formal; 'perform ablations' is also used; 'do ablations' is informal and less common in academic writing.
4 / 5
Fill in: 'The paper includes a table that allows readers to ___ baselines across five public datasets.'
We 'compare baselines' — 'compare' is the natural collocation for juxtaposing multiple models or methods to identify which performs best. 'Benchmark baselines' is redundant since a baseline is itself a reference point; 'contrast' implies highlighting differences rather than measuring performance; 'evaluate baselines' focuses on individual assessment rather than relative comparison.
5 / 5
Fill in: 'At the end of each experiment, the team must ___ metrics using the standardised reporting template.'
We 'report metrics' — 'report' is the standard collocation for formally communicating model performance results to peers or stakeholders. 'Record metrics' emphasises storage; 'log metrics' is used in experiment tracking systems; 'share metrics' is informal and implies optional, ad-hoc communication rather than a required deliverable.