ML Model Serving & Inference
Model serving vocabulary: inference endpoints, quantisation, GPU inference, shadow deployment, prediction drift monitoring, and champion/challenger patterns.
- Inference Endpoint /ˈɪnfərəns ˈɛndpɔɪnt/
An HTTP/gRPC endpoint that hosts a trained ML model and accepts prediction requests. Returns predictions synchronously (online inference) or processes batches asynchronously (batch inference). Deployment options: dedicated server, serverless function, model-as-a-service platform.
"Our fraud detection model is deployed as an inference endpoint behind our payment API. The endpoint accepts a transaction payload, runs the model, and returns a fraud probability score within 20ms. SLA: p99 latency under 50ms. The endpoint is horizontally scaled during peak hours — the model artefact is loaded once and shared across requests."
- Model Quantisation /ˈmɒdəl kwɒntɪˈzeɪʃən/
Reducing the numerical precision of model weights (from FP32 to INT8 or FP16) to decrease memory usage and accelerate inference. INT8 quantisation typically cuts memory in half and doubles throughput with minimal accuracy loss for many tasks.
"Our LLM was too large for GPU memory at FP32 (28GB). After INT8 quantisation with bitsandbytes, the model fits in a single 16GB GPU with 0.5% accuracy drop on our evaluation suite. Throughput improved 2.3x. For latency-sensitive paths, we use FP16 (better accuracy/speed balance than INT8 on modern GPUs)."
- Triton Inference Server /ˈtraɪtən ˈɪnfərəns ˈsɜːvər/
NVIDIA's open-source model serving framework. Supports multiple backends (TensorRT, ONNX Runtime, PyTorch, TensorFlow). Provides dynamic batching, model ensembles, GPU/CPU scheduling, and concurrent model execution. Production-grade performance for GPU inference at scale.
"We migrated from Flask serving to Triton Inference Server. Benefits: dynamic batching groups requests arriving within a 5ms window into a single GPU batch, increasing GPU utilisation from 30% to 85%. Model ensembles run preprocessing and the main model as a pipeline within Triton — no network round trip between stages."
- Shadow Deployment (Model) /ˈʃædəʊ dɪˈplɔɪmənt/
Running a new model version in parallel with the production model: the new model receives the same requests and generates predictions, but the production model's predictions are used. The shadow model's predictions are logged for offline evaluation without affecting users.
"Before promoting the new recommendation model to production, we shadow-deployed it for two weeks. Production traffic was mirrored to both models; users saw the old model's recommendations. We compared offline: new model NDCG@10 = 0.54 vs. old = 0.47. Zero user impact, two weeks of real-traffic evaluation. Promoted after validation."
- Champion/Challenger Pattern /ˈtʃæmpiən ˈtʃælɪndʒər ˈpætən/
Running two model versions in production simultaneously: the champion serves most traffic, the challenger serves a small percentage. Both are monitored; if the challenger outperforms, it becomes the new champion. A formalised A/B test for model versions.
"We have champion/challenger running continuously for our ranking model. The champion serves 95% of requests; the challenger (new model version) serves 5%. After 7 days, if the challenger's click-through rate is statistically significantly higher, it automatically becomes the new champion. This creates a safe, continuous improvement loop."
- Prediction Drift / Data Drift /prɪˈdɪkʃən drɪft/
Prediction drift: the distribution of model outputs changes over time (e.g., fraud scores shifting higher). Data drift: the distribution of input features changes (e.g., user demographics shifting). Both signal the model may be becoming stale and require retraining.
"Three months after launch, our churn prediction model's average confidence score dropped from 0.72 to 0.61 (prediction drift). Investigation revealed data drift: a new mobile app changed user engagement patterns significantly. The model was trained on pre-app data. We triggered retraining with the last 90 days of data — scores normalised within a week."
- ONNX Runtime /ɒnɪks ˈrʌntaɪm/
A cross-platform, high-performance inference engine for ONNX format models. Supports hardware acceleration (CUDA, TensorRT, DirectML, CoreML). Enables exporting models from any framework (PyTorch, TensorFlow) to a single interoperable format for deployment.
"We train in PyTorch, export to ONNX, and serve with ONNX Runtime. Benefits: faster inference than native PyTorch (no Python overhead), cross-platform deployment (same model runs on Linux with CUDA and Windows with DirectML), and simpler production containers (no PyTorch dependency in the serving image, just onnxruntime-gpu)."
Quick Quiz — ML Model Serving & Inference
Test yourself on these 7 terms. You'll answer 7 multiple-choice questions — each shows a term, you pick the correct definition.
What does this term mean?