5 exercises — choose the best-structured answer to common AI infrastructure interview questions. Focus on GPU cluster design, distributed training sharding, checkpointing, KV cache management, and LLM serving architecture.
Structure for AI infrastructure interview answers
Give bandwidth numbers: NVLink 900 GB/s, NDR IB 400 Gbps — concrete specs show hands-on experience
Separate intra-node from inter-node: NVLink/NVSwitch within a node, InfiniBand between nodes
Quantify memory: calculate actual GB for the model size — interviewers want to see you can size a system
Cover failure modes: atomic rename for checkpoints, PERMISSIVE left on for mesh — failure awareness signals seniority
0 / 5 completed
1 / 5
The interviewer asks: "Design the network topology for a GPU cluster training a 100B parameter model — what interconnect technologies do you use and why?" Which answer best covers GPU cluster networking?
Option B provides the complete topology picture: NVLink bandwidth numbers (H100 NVLink 4.0, 900 GB/s bidirectional), NVSwitch crossbar architecture, IB generation speeds (HDR 200 Gbps, NDR 400 Gbps), HCA-per-GPU to avoid CPU bottleneck, fat-tree and rail-optimised topologies at scale, GPUDirect RDMA for CPU-bypass, NCCL algorithm selection (tree vs ring based on message size), and RoCEv2 as the Ethernet alternative with its requirements. Options A, C, D name the correct components but provide no bandwidth numbers, topology design choices, or RDMA mechanics.
2 / 5
The interviewer asks: "Compare PyTorch FSDP and DeepSpeed ZeRO (Stage 1, 2, 3) for training a 70B parameter model — what do they shard and what are the trade-offs?" Which answer best covers distributed training sharding?
Option B provides quantified analysis: exact memory savings per ZeRO stage (4×, 8×, linear in N), concrete memory calculation for 70B model (140GB at FP16, 2.2GB/GPU with ZeRO-3 on 64 GPUs), FSDP all-gather/reduce-scatter mechanics with the discard pattern, FSDP vs ZeRO-3 comparison on torch.compile integration and CPU/NVMe offloading, ZeRO-Infinity for trillion-parameter models, and activation checkpointing as an orthogonal technique. Options A and C state the facts without the memory calculations, offload mechanics, or the "when to use which" guidance.
3 / 5
The interviewer asks: "What is your checkpointing strategy for training a 100B+ parameter model on a 1000-GPU cluster — what can go wrong and how do you mitigate it?" Which answer best covers checkpoint engineering?
Option B covers all seven dimensions: checkpoint size calculation (200GB), asynchronous checkpointing mechanics (CPU RAM copy + background thread), distributed checkpoint storage (DCP API + Lustre/GPFS/S3 with per-rank shards), rolling window policy with corruption rationale, failure mode mitigation (atomic rename, Lustre stripe count), elastic training for fault tolerance (TorchElastic + NVIDIA Resiliency Lib for in-GPU recovery), and checkpoint validation. Options C and D each mention 2-3 correct ideas but none cover the atomic rename pattern, Lustre stripe tuning, TorchElastic fault tolerance, or checkpoint validation.
4 / 5
The interviewer asks: "Explain KV cache management in LLM inference — what is the KV cache, why does it grow, and what eviction strategies exist?" Which answer best covers LLM inference cache engineering?
Option B provides the complete engineering picture: KV cache memory formula with a concrete Llama-3 70B calculation (2MB/token, 8GB for 4K context), growth factors, LRU eviction and its system-prompt problem, H2O attention-score-based eviction with the "heavy hitter" mechanism, PagedAttention with the OS virtual memory analogy and the fragmentation problem it solves, prefix caching with the system-prompt sharing use case, and disaggregated prefill/decode as a scaling strategy. Options A, C, D each mention 1-2 mechanisms without the memory calculation, fragmentation problem, or H2O algorithm.
5 / 5
The interviewer asks: "Explain disaggregated prefill and decode in LLM serving — why are they separated, and how does this affect your infrastructure design?" Which answer best covers serving infrastructure architecture?
Option B covers all six dimensions: prefill (compute-bound, single forward pass) vs decode (memory-bandwidth-bound, sequential) characteristics, the disaggregation architecture (router + prefill pool + decode pool + KV transfer), provisioning guidance (compute vs HBM bandwidth, H100 3.35 TB/s number), KV cache transfer latency calculation (8GB / 400 Gbps = 160ms, pipelining as mitigation), production systems implementing PD disaggregation (Mooncake, TetriInfer, vLLM), and continuous batching as the simpler alternative. Options A, C, D each identify the prefill/decode distinction but don't cover the transfer latency, provisioning guidance, or production system examples.