Build fluency in the vocabulary of a model built from sparsely activated expert sub-networks.
0 / 5 completed
1 / 5
At standup, a dev mentions a model built from many separate 'expert' sub-networks, where only a small subset is activated for any given input rather than the whole model running every time. What is this architecture called?
A mixture-of-experts architecture is built from many separate expert sub-networks, activating only a small subset for a given input rather than running the whole model every time. A single dense network activates every parameter for every input, which costs far more compute per request at a comparable total parameter count. This sparse activation is what lets an MoE model grow its total parameter count without a proportional increase in per-token compute cost.
2 / 5
During a design review, the team wants a small network to decide, per input token, which handful of experts should actually process it. Which capability supports this?
A gating, or router, network decides per token which handful of experts should actually process it, typically selecting a small top-k subset out of the full expert pool. Sending every token to every expert defeats the entire purpose of sparse activation and would cost as much compute as a dense model. This router is the component that makes the mixture-of-experts architecture's efficiency actually work in practice.
3 / 5
In a code review, a dev notices an auxiliary loss term is added during training specifically to keep token traffic spread evenly across experts, rather than letting the router collapse onto a favorite few. What does this represent?
A load-balancing loss keeps token traffic spread evenly across experts during training, preventing the router from collapsing onto a small favored subset while other experts go undertrained. Training with no such term risks exactly that collapse, wasting the capacity of an underused expert. This balancing loss is what keeps an MoE model's full expert pool genuinely useful rather than most of it sitting idle.
4 / 5
An incident report shows a mixture-of-experts model's quality was noticeably uneven across topics because the router had collapsed onto a handful of experts during training, leaving most of the pool essentially untrained. What practice would prevent this?
Applying a load-balancing loss during training spreads token traffic across the full expert pool, rather than letting the router settle onto a small favored subset. Training with no such term risks exactly the uneven quality this incident describes, since most of the pool never receives enough training signal. This balancing term is a standard, necessary safeguard whenever a mixture-of-experts model is trained from scratch.
5 / 5
During a PR review, a teammate asks why the team adopts a mixture-of-experts architecture instead of just building one larger dense model with the same total parameter count. What is the reasoning?
A dense model activates every one of its parameters for every input, so its per-token compute cost scales directly with its total size. An MoE model activates only a small subset of experts per token, reaching a much larger total parameter count without that same proportional compute cost. The tradeoff is the added complexity of training a stable router and keeping traffic balanced across the expert pool.