AdvancedVocabulary#data-science-ml#backend#developer-tools

Knowledge Distillation Vocabulary

Build fluency in the vocabulary of training a small student model to mimic a larger teacher model's outputs.

0 / 5 completed

1 / 5

At standup, a dev mentions training a small, fast model to mimic the output probabilities of a much larger, more accurate model, so the small model captures most of the larger model's behavior at a fraction of its inference cost. What is this technique called?

2 / 5

During a design review, the team distills a large model into a small student model for mobile deployment, specifically because mimicking the teacher's output probabilities lets the student capture nuanced behavior that training on raw labels alone would miss. Which capability does this provide?

3 / 5

In a code review, a dev notices a mobile-deployment pipeline trains a small model from scratch on raw hard labels alone, with no reference to a larger, more accurate teacher model's output probabilities, instead of using knowledge distillation to transfer the teacher's nuanced behavior. What does this represent?

4 / 5

An incident report shows a mobile app's small on-device model performed noticeably worse than expected, because it was trained from scratch on raw hard labels alone with no reference to the far more accurate teacher model already available on the server. What practice would prevent this?

5 / 5

During a PR review, a teammate asks why the team reaches for knowledge distillation instead of simply training the small model directly on the same raw labeled dataset the teacher model used. What is the reasoning?