Learn the vocabulary of computing every parameter's gradient in one backward pass through a neural network.
0 / 5 completed
1 / 5
At standup, a dev mentions computing the gradient of a neural network's loss with respect to every parameter by applying the chain rule backward from the output layer to the input layer, reusing intermediate derivatives along the way. What is this algorithm called?
Backpropagation is exactly this: backpropagation computes the gradient of a neural network's loss with respect to every parameter by applying the chain rule backward from the output layer to the input layer, reusing each layer's intermediate derivatives so every parameter's gradient is obtained in a single backward pass. A hash collision is an unrelated hash-table concept about two keys sharing a bucket. This reuse-intermediate-derivatives approach is exactly why backpropagation can compute gradients for a network with millions of parameters in roughly the same time as one forward pass.
2 / 5
During a design review, the team relies on backpropagation to train a deep network with many layers, specifically because reusing each layer's intermediate derivatives during the backward pass avoids recomputing a full forward pass for every single parameter. Which capability does this provide?
Backpropagation here provides Efficient gradient computation for every parameter in one backward pass, since the chain rule lets each layer's derivative be reused by the layers before it, keeping the total cost roughly proportional to a single forward pass no matter how many parameters the network has. Recomputing a full forward pass separately for every parameter, as naive finite-difference estimation would require, scales terribly as parameter count grows. This derivative-reuse behavior is exactly why backpropagation is the standard way to train deep neural networks efficiently.
3 / 5
In a code review, a dev notices a neural-network training feature estimates each parameter's gradient by perturbing that one parameter slightly and re-running a full forward pass to see how the loss changed, instead of reusing intermediate derivatives in one backward pass. What does this represent?
This is a missed backpropagation opportunity, since applying the chain rule backward and reusing each layer's intermediate derivative would compute every parameter's gradient in one backward pass instead of a separate forward pass per parameter. A cache eviction policy is an unrelated concept about discarded cache entries. This perturb-and-rerun pattern is exactly the kind of inefficiency a reviewer flags once the network has enough parameters for the per-parameter cost to matter.
4 / 5
An incident report shows a training job's cost scaled directly with the number of parameters in the network, because it estimated each parameter's gradient with a separate perturb-and-rerun forward pass instead of one backward pass reusing intermediate derivatives. What practice would prevent this?
Switching to backpropagation replaces per-parameter forward passes with one shared backward pass that reuses intermediate derivatives. Continuing to estimate each parameter's gradient with a separate perturb-and-rerun forward pass regardless of how many parameters the network has is exactly what caused the issue described in this incident. This backward-pass approach is the standard fix once the network's parameter count makes per-parameter estimation impractical.
5 / 5
During a PR review, a teammate asks why the team reaches for backpropagation instead of numerical finite-difference gradient checking, given that finite differences are simpler to implement. What is the reasoning?
Backpropagation computes exact gradients for every parameter in one backward pass via the chain rule, while finite-difference checking is simple to reason about but requires a separate forward pass per parameter, making it far too slow for actually training a large network and mainly useful for verifying backpropagation's correctness on a small scale. This is exactly why backpropagation is the standard training method, while finite differences remain a debugging tool rather than a training technique.