5 exercises — practice structuring strong English answers for ML Compiler and Systems Engineer interviews: XLA compilation, TVM relay IR, MLIR dialects, operator fusion, and kernel auto-tuning.
Operator fusion questions: memory bandwidth bottleneck → which operators fuse → when fusion hurts (register spilling)
Auto-tuning questions: search space size → cost model (XGBoost) → evolutionary search → hardware transfer limitation
0 / 5 completed
1 / 5
The interviewer asks: "How does XLA compile a computation graph to hardware? Walk me through the pipeline." Which answer is most precise?
Option B is strongest. It presents XLA as a five-stage pipeline, explaining not just what each stage does but WHY it exists — HLO is hardware-agnostic (enabling portability), layout assignment minimises memory copies (the mechanism behind performance on different hardware), buffer assignment uses live range analysis (a classic compiler technique applied to ML). The fusion section explains the key performance benefit: eliminating intermediate DRAM writes between producer-consumer elementwise operations. The caching section explains the first-run latency that all XLA users observe but few understand. XLA vocabulary:HLO (High-Level Operations) — XLA's hardware-agnostic IR with primitive operations. LHLO — Lowered HLO with concrete buffer assignments. Layout assignment — choosing NCHW vs. NHWC tensor memory layout to minimise copies. Live range analysis — determining when each buffer is last used to enable memory reuse. PTX — NVIDIA's parallel thread execution assembly language. Options C and D name the stages but lack the reasoning behind each stage and the DRAM write elimination insight for fusion.
2 / 5
The interviewer asks: "How does TVM's schedule search differ from XLA's approach to kernel optimisation?" Which answer is most complete?
Option B is strongest. The opening framing — XLA has curated kernels while TVM searches a larger space — is the correct mental model. The algorithm/schedule separation is TVM's architectural insight: the same computation (matmul) can be executed with different loop structures optimised for different hardware. The AutoTVM section quantifies the search reduction (10^6 → 1000-2000 measurements), making the cost model value concrete. The Ansor/MetaSchedule evolution shows awareness of the field's trajectory. The trade-off section gives a concrete recommendation: XLA for NVIDIA/TPU rapid deployment, TVM for heterogeneous/edge hardware. TVM vocabulary:Relay IR — TVM's typed functional IR for whole-model computation. Tensor Expression (TE) — TVM's schedule language specifying how to execute a computation. AutoTVM — template-based auto-tuning with XGBoost cost model. Ansor — templateless auto-scheduler using sketch-based generation. microTVM — TVM's microcontroller deployment target. Options C and D name the components correctly but lack the algorithm/schedule separation concept and the measurement reduction quantification.
3 / 5
The interviewer asks: "What is MLIR and why did the compiler community converge on it?" Which answer is most architectural?
Option B is strongest. It opens with two named problems before introducing MLIR, which is the correct problem-first framing. The N×M problem is quantified (F×H implementations), making it concrete. The abstraction mismatch section explains WHY a single IR is insufficient — each level needs different optimisations. The dialect section identifies the linalg dialect as the bridge dialect (a specific design insight experienced ML compiler engineers recognise as key). Progressive lowering is explained precisely: a sequence of passes, each transforming one dialect to a lower-level one, with optimisations at the appropriate level. The key benefit — reusable passes break N×M — closes the argument cleanly. MLIR vocabulary:Dialect — a self-contained set of MLIR operations and types. Progressive lowering — transforming IR through a sequence of dialect-lowering passes. Linalg dialect — named tensor contractions serving as the bridge between high-level ops and loop nests. Affine dialect — polyhedral loop nest representations for loop optimisation. mhlo dialect — HLO ops as MLIR dialect, used by JAX/XLA. Options C and D name dialects correctly but lack the problem framing and the reusable-passes payoff.
4 / 5
The interviewer asks: "Explain operator fusion in ML compilers. Which operators fuse well and when can fusion hurt performance?" Which answer is most precise?
Option B is strongest. The DRAM access count (10 for 5 unfused vs. 2 for fused) makes the benefit quantitative and concrete. The three fusion cases include FlashAttention as a non-trivial example — O(n²) to O(n) attention memory elimination — which demonstrates awareness of current production techniques. The three cases where fusion hurts are genuine production pitfalls: shape mismatch causing register spilling (explained mechanistically — incompatible loop structures force register use), the compute-bound dominance case (explaining WHY fusing into a compute-bound kernel causes issues), and the 255-register CUDA limit (a specific hardware constraint showing CUDA programming experience). Operator fusion vocabulary:Memory-bandwidth-bound — performance limited by DRAM bandwidth, not compute throughput. Register spilling — values spilled to local memory when register file is exceeded. FlashAttention — fused attention with online softmax, O(n) memory. Accumulator register — register where matmul partial sums accumulate. Nsight Systems — NVIDIA GPU profiling tool. Options C and D are accurate but lack the DRAM access count and the register limit hardware detail.
5 / 5
The interviewer asks: "How does kernel auto-tuning work and what are the limitations of cost-model-based search?" Which answer is most complete?
Option B is strongest. The search space quantification (> 10^6 configurations for a single matmul) anchors the motivation for cost-model search. The three strategies are presented in order with the key metric: AutoTVM reduces measurements from 10^6 to 1000-2000 — a 1000× reduction. The four cost model limitations are all genuine production concerns: hardware transfer (must re-tune for every GPU generation — a major operational cost), sparse early data (explains why the first tuning iterations are slow), interaction effects (explains why simple feature representations underperform), and measurement noise (±5-10% is a specific, verified figure). The production strategy (pre-tune + cache, lazy tune for uncommon shapes) shows operational maturity. Auto-tuning vocabulary:Search space — all valid configuration vectors for a kernel. Cost model — predicts latency from configuration features, trained on hardware measurements. Evolutionary search — mutation and crossover of candidate configurations. Shared memory bank conflict — serialisation penalty from multiple threads accessing the same memory bank. Software pipeline — overlapping memory transfers with compute across iterations. Options C and D are accurate but lack the search reduction quantification and the interaction effects explanation.