Practice edge inference vocabulary: TensorFlow Lite, ONNX Runtime, on-device inference, edge latency, model quantization, and edge vs cloud inference trade-offs.
0 / 5 completed
1 / 5
What does 'the model runs on-device without a cloud call' mean?
On-device (edge) inference runs the ML model directly on the device's CPU, GPU, or neural processing unit. This eliminates cloud round-trip latency, reduces bandwidth costs, enables offline operation, and addresses privacy concerns by keeping data local.
2 / 5
What is TensorFlow Lite?
TensorFlow Lite (TFLite) converts TensorFlow models to a compact .tflite format and provides an optimised inference runtime for edge devices. It supports hardware acceleration on ARM, Android Neural Networks API, and microcontrollers via TF Lite Micro.
3 / 5
What does 'model quantization' do in edge inference?
Quantization reduces weight precision from float32 to int8 (or even int4), shrinking model size by 4x and speeding up inference significantly on edge hardware that lacks dedicated FP32 units. Typical accuracy loss is less than 1% with post-training quantization.
4 / 5
What is ONNX Runtime used for on edge devices?
ONNX Runtime is a high-performance inference engine for ONNX-format models. It supports execution providers for various edge hardware (CUDA, TensorRT, ARM Compute Library, DirectML), making it a popular choice for cross-framework, cross-hardware edge deployment.
5 / 5
A team says 'we moved inference to the edge to reduce ___.' What word fits?
Latency is the primary driver for moving inference to the edge. Cloud inference requires a network round trip (potentially 50-500ms). Edge inference takes milliseconds locally — critical for real-time use cases like manufacturing defect detection, autonomous vehicles, or voice interfaces.