GPU & Inference Scaling Vocabulary

Practise vocabulary for scaling ML inference: GPU utilisation, dynamic batching, autoscaling, cold starts, and throughput vs latency trade-offs.

0 / 5 completed

1 / 5

Grouping several incoming requests so they run together on the GPU in a single forward pass is called ___.

2 / 5

The fraction of time the GPU is actively computing rather than idle is its ___.

3 / 5

The delay before a freshly scaled-up replica can serve traffic because it must load the model into GPU memory is the ___.

4 / 5

Automatically adding or removing inference replicas based on request load is ___.

5 / 5

Increasing batch size usually improves ___ but can worsen per-request ___.