Build fluency in the vocabulary of training a small student model to mimic a larger teacher model's outputs.
0 / 5 completed
1 / 5
At standup, a dev mentions training a small, fast model to mimic the output probabilities of a much larger, more accurate model, so the small model captures most of the larger model's behavior at a fraction of its inference cost. What is this technique called?
Knowledge distillation is exactly this: it trains a small, fast student model to mimic the output probabilities of a much larger, more accurate teacher model, so the student captures most of the teacher's behavior at a fraction of its size and inference cost. A hash collision is an unrelated hash-table concept about two keys sharing a bucket. This mimic-the-teacher's-output-distribution approach is exactly why knowledge distillation can deploy near-teacher-quality behavior on hardware too small to run the full teacher model.
2 / 5
During a design review, the team distills a large model into a small student model for mobile deployment, specifically because mimicking the teacher's output probabilities lets the student capture nuanced behavior that training on raw labels alone would miss. Which capability does this provide?
Knowledge distillation here provides near-teacher-quality behavior at a fraction of the inference cost, since the student learns from the teacher's full output distribution, which carries richer signal than training on hard labels alone, while still running small enough to fit mobile hardware. Running the full teacher model directly on mobile hardware may simply be infeasible given memory and latency constraints. This learn-from-the-full-distribution behavior is exactly why knowledge distillation is the standard way to compress a large model for constrained deployment.
3 / 5
In a code review, a dev notices a mobile-deployment pipeline trains a small model from scratch on raw hard labels alone, with no reference to a larger, more accurate teacher model's output probabilities, instead of using knowledge distillation to transfer the teacher's nuanced behavior. What does this represent?
This is a missed knowledge-distillation opportunity, since training the small model against the teacher's output probabilities would transfer richer, nuanced behavior instead of relying on raw hard labels alone. A cache eviction policy is an unrelated concept about discarded cache entries. This train-from-scratch-on-labels-only pattern is exactly the kind of missed quality gain a reviewer flags once a stronger teacher model already exists.
4 / 5
An incident report shows a mobile app's small on-device model performed noticeably worse than expected, because it was trained from scratch on raw hard labels alone with no reference to the far more accurate teacher model already available on the server. What practice would prevent this?
Applying knowledge distillation trains the small model against the teacher's output probabilities, transferring its nuanced behavior. Continuing to train the small model from scratch on raw hard labels alone regardless of how much more accurate an available teacher model is is exactly what caused the quality gap described in this incident. This distillation-from-an-existing-teacher approach is the standard fix once a stronger teacher model is already available.
5 / 5
During a PR review, a teammate asks why the team reaches for knowledge distillation instead of simply training the small model directly on the same raw labeled dataset the teacher model used. What is the reasoning?
Knowledge distillation trains the small model against the teacher's full output probability distribution, which encodes richer relative-confidence information between classes than a single hard label ever could, while training directly on raw labels alone discards that richer signal and typically yields a less accurate small model. This is exactly why knowledge distillation is the standard way to compress a large model while retaining as much of its quality as possible.