Practice English vocabulary for ML model serving: ONNX serialization, serving at scale, versioned endpoints, blue-green deployment, and gradual traffic shifting.
0 / 5 completed
1 / 5
What does 'the model is serialized to ONNX format' mean?
ONNX provides framework-agnostic model portability. A model trained in PyTorch can be exported to ONNX and then run with ONNX Runtime, which often provides faster inference than the training framework. It also enables deployment to edge devices and specialized hardware accelerators.
2 / 5
What does 'the serving infrastructure handles 10K QPS' mean?
QPS (Queries Per Second) is the throughput requirement for the serving infrastructure. Meeting 10K QPS might require multiple model server replicas, GPU acceleration, batching requests, or caching frequent predictions. Load testing validates the target throughput before production launch.
3 / 5
What does 'the model endpoint is versioned' mean?
Versioned model endpoints enable controlled migration: clients can test the new version at /v2 while remaining on /v1 in production. This prevents silent breaking changes when model behavior changes and allows rollback by switching the client back to the previous endpoint.
4 / 5
What is 'blue-green deployment for ML models'?
Blue-green deployment for models eliminates deployment downtime and enables instant rollback. The new model (green) is validated with shadow traffic or a small percentage of real traffic. Once confidence is established, all traffic switches from blue to green instantly via a load balancer configuration change.
5 / 5
What does 'traffic is gradually shifted to the new model' mean?
Gradual traffic shifting (also called canary deployment) is the safest model rollout strategy. It exposes a small fraction of real users to the new model while monitoring key metrics (accuracy, latency, error rate). Problems at 1% traffic affect far fewer users than a full rollout.