5 exercises — practice structuring strong English answers for Edge AI and TinyML engineering interviews: quantisation, pruning, knowledge distillation, ONNX Runtime, and mobile deployment.
How to structure Edge AI interview answers
Quantisation questions: name format → memory reduction factor → calibration requirement → hardware support → accuracy loss range
Pruning questions: unstructured vs. structured → hardware implication (latency benefit or not) → practical recommendation
Knowledge distillation questions: three KD variants → when KD beats quantisation → combined distil-then-quantise strategy
Deployment questions: name both paths → trace vs. script distinction → mobile optimisation → benchmarking on real hardware
The interviewer asks: "Explain the trade-offs between INT8 quantisation, INT4 quantisation, and FP16 for deploying a model on edge hardware." Which answer is most precise?
Option B is strongest. It structures the answer around three dimensions — memory footprint, compute cost, and accuracy degradation — with hardware dependency context for each. The FP16 section adds the critical microcontroller caveat (no FP16 hardware → software emulation → slower than FP32), which shows real edge deployment experience. The INT8 section explains the calibration pipeline (100-500 sample calibration dataset, scale/zero-point parameters) and distinguishes PTQ vs. QAT with specific accuracy loss ranges by architecture type (CNNs vs. transformers). The INT4 section correctly identifies that production INT4 is weight-only quantisation, not full INT4 (a common misconception), and names GPTQ, AWQ, and NF4 as the dominant formats. The GGUF closing ties everything to practical llama.cpp deployment on CPU-only edge hardware. Edge quantisation vocabulary:Post-Training Quantisation (PTQ) — quantising a trained model using a calibration dataset to compute scale parameters. Quantisation-Aware Training (QAT) — inserting fake quantisation nodes during training to recover accuracy. Scale and zero-point — the two parameters mapping floating-point values to integer representations. Weight-only quantisation — quantising only weights (not activations) to reduce model size while preserving runtime accuracy. GGUF — the file format for llama.cpp quantised model deployment. Options C and D are accurate but lack the hardware dependency context and the PTQ pipeline detail.
2 / 5
The interviewer asks: "What is knowledge distillation and when is it more effective than quantisation alone for edge deployment?" Which answer is most complete?
Option B is strongest. It introduces the three KD variants with their use cases and differences — response-based, feature-based, and relation-based — where most candidates know only response-based KD. The temperature scaling explanation is precise: T=5-10 amplifies small probability differences to create richer training signal (the intuition behind why soft labels work). The loss function formula (αCE + (1-α)KL divergence) is the correct form candidates should know for senior roles. The feature-based section names TinyBERT and DistilBERT as production examples and identifies the layer-mapping requirement as the practical constraint. The "when KD beats quantisation" section is structured around three specific scenarios with reasoning, not just "when accuracy matters." The combined approach closes correctly: distil first to get a hardware-friendly architecture, then quantise. Knowledge distillation vocabulary:Temperature scaling — dividing logits by T > 1 to produce softer probability distributions for student training. Soft labels — probability distributions over classes produced by the teacher, containing more information than one-hot hard labels. Feature-based KD — training the student to match intermediate feature maps or attention patterns of the teacher. DistilBERT / TinyBERT — production NLP models produced by knowledge distillation from BERT. Layer mapping — the strategy for matching student layers to teacher layers when they have different depths. Options C and D list the variants but lack the loss function formulation and the specific accuracy-loss-budget framing.
3 / 5
The interviewer asks: "How does structured pruning differ from unstructured pruning, and what are the hardware implications of each?" Which answer is most precise?
Option B is strongest. The key insight — that unstructured pruning does NOT yield latency benefits on standard hardware — is precisely explained with the mechanism (matrix dimensions unchanged, standard hardware executes dense matmul). Most candidates say "pruning makes the model faster" without this hardware nuance. The NVIDIA 2:4 Sparse Tensor Core context (the specific exception that allows unstructured sparsity to be hardware-efficient) shows genuine depth. The structured pruning explanation is concrete: 30% filter pruning → 30% fewer output channels → 30% compute reduction, which is the physical intuition. The accuracy trade-off section correctly identifies that structured pruning is coarser, not better or worse overall. The Lottery Ticket Hypothesis reference is a recognisable research landmark for senior roles. Pruning vocabulary:Unstructured pruning — zeroing individual weights, creating sparse matrices without changing dimensions. Structured pruning — removing entire filters, heads, or layers, producing smaller dense matrices. NVIDIA 2:4 structured sparsity — a hardware-level sparsity pattern (2 non-zero values per 4 weights) supported by Sparse Tensor Cores on A100/H100. Gradual Magnitude Pruning — iteratively pruning and retraining to recover accuracy. Lottery Ticket Hypothesis — the observation that sparse subnetworks within dense networks can match full network accuracy when trained from scratch. Options C and D are accurate but lack the matrix dimension explanation and the 2:4 sparsity hardware exception.
4 / 5
The interviewer asks: "You need to deploy a PyTorch model to an Android device running an ARM Cortex-A processor. Walk me through the deployment pipeline." Which answer is most complete?
Option B is strongest. It presents two viable paths rather than one, which is the correct answer because neither path is universally superior — the choice depends on the model's control flow and operator support. The TorchScript section explains the trace vs. script distinction (trace fails for control-flow-dependent models), which is a common production gotcha. The mobile_optimizer call (Conv+BN+ReLU fusion for ARM) shows knowledge of hardware-specific graph optimisation. The ONNX Runtime Mobile section introduces the custom build tool (reducing binary size from 1.5MB to 300KB by including only required operators), which is an important deployment consideration for mobile apps where APK size matters. The third option (TFLite path) is correctly dismissed because multi-step conversion introduces operator gaps. The benchmarking section adds thermal throttling as a production concern — a mobile-specific failure mode where the device throttles CPU frequency under sustained inference load. Edge deployment vocabulary:TorchScript — PyTorch's compilation format for production deployment (trace or script). ONNX opset version — the operator set version for ONNX compatibility. ORT custom build — building ONNX Runtime with only the required operators to minimise binary size. Thermal throttling — CPU/GPU frequency reduction under sustained load to prevent overheating. Android Lite Interpreter — the lightweight PyTorch Mobile runtime for Android. Options C and D are accurate but lack the trace vs. script failure mode explanation and the ORT binary size reduction detail.
5 / 5
The interviewer asks: "How does ONNX Runtime achieve cross-platform performance, and what are its execution providers?" Which answer is most complete?
Option B is strongest. The two-layer architecture framing (graph optimisation layer + EP layer) is the correct mental model for understanding how ORT achieves cross-platform performance, and most candidates answer this question by listing EPs without explaining the architecture. The three graph optimisation levels (basic / extended / layout) with specific examples (attention fusion merging QKV+attention+softmax into one kernel) shows knowledge of the most impactful performance gain. The MLAS fallback detail (Microsoft Linear Algebra Subprograms for ARM/x86) is a specific technical detail that distinguishes candidates who have studied ORT internals. The EP fallback mechanism (graph partitioning — supported nodes go to EP, unsupported fall back to CPU) is an important nuance: partial acceleration is possible and ORT handles it automatically. The ORT-Micro detail for microcontrollers shows awareness of the embedded/MCU end of the edge spectrum. ONNX Runtime vocabulary:Execution Provider (EP) — a hardware-specific backend that executes ONNX graph nodes. Attention fusion — merging separate attention operator nodes into a single fused kernel for efficiency. MLAS — Microsoft Linear Algebra Subprograms, the CPU backend for ORT. NNAPI — Android Neural Networks API for NPU/DSP acceleration. ORT-Micro — ONNX Runtime for Microcontrollers, generating C-code for embedded targets. Options C and D are accurate but lack the two-layer architecture explanation and the graph partitioning fallback mechanism.