1 / 5
Grouping several incoming requests so they run together on the GPU in a single forward pass is called ___.
-
-
-
-
Dynamic batching waits a few milliseconds to collect multiple requests and process them as one batch, raising GPU throughput at a small latency cost.
2 / 5
The fraction of time the GPU is actively computing rather than idle is its ___.
-
-
-
-
GPU utilisation measures how busy the device is; low utilisation under load usually signals a CPU/IO bottleneck or poor batching.
3 / 5
The delay before a freshly scaled-up replica can serve traffic because it must load the model into GPU memory is the ___.
-
-
-
-
Cold start latency comes from spinning up the container and loading large model weights; keeping warm replicas avoids it.
4 / 5
Automatically adding or removing inference replicas based on request load is ___.
-
-
-
-
Autoscaling adjusts the replica count (often on GPU utilisation or queue depth) so you pay for capacity roughly proportional to demand.
5 / 5
Increasing batch size usually improves ___ but can worsen per-request ___.
-
-
-
-
Larger batches amortise overhead and lift throughput, but individual requests wait longer to be batched and processed, raising latency.