AdvancedVocabulary#data-science-ml#backend#architecture

ML Model Serving Vocabulary

Build fluency in the vocabulary of exposing a trained model behind a low-latency inference API.

0 / 5 completed

1 / 5

At standup, a dev mentions a dedicated piece of infrastructure that exposes a trained model behind a low-latency inference API, distinct from the pipeline that trained the model in the first place. What is this infrastructure called?

2 / 5

During a design review, the team wants several incoming inference requests grouped together and sent to the model as one batch, rather than processed one at a time, to make better use of the underlying GPU. Which capability supports this?

3 / 5

In a code review, a dev notices the model is loaded once and kept resident in memory across many requests, rather than being reloaded from disk for every single inference call. What does this represent?

4 / 5

An incident report shows inference latency spiked dramatically under production load because the model was being reloaded from disk on every single request instead of being kept warm in memory across requests. What practice would prevent this?

5 / 5

During a PR review, a teammate asks why the team deploys a dedicated model-serving layer with request batching instead of just invoking the model directly inside the application server for each incoming request. What is the reasoning?