Learn the vocabulary of trading extra compute for reduced memory usage during model training.
0 / 5 completed
1 / 5
At standup, a dev mentions a training technique that discards a layer's intermediate activations during the forward pass and recomputes them during the backward pass, trading extra compute for reduced memory usage. What is this technique called?
Activation checkpointing, or gradient checkpointing, discards a layer's intermediate activations during the forward pass and recomputes them during the backward pass, trading extra compute time for meaningfully reduced memory usage. Storing every intermediate activation throughout training uses far more memory, which can make a larger model or batch size infeasible on the available hardware. This tradeoff is what lets a team fit a bigger model or batch into the same fixed memory budget.
2 / 5
During a design review, the team wants to decide specifically which layers act as checkpoint boundaries, balancing how much recomputation happens during the backward pass against how much memory is actually saved. Which capability supports this?
Checkpoint boundary placement deliberately chooses which layers act as checkpoints, balancing how much recomputation happens during the backward pass against how much memory is actually saved. Checkpointing every layer uniformly, with no deliberate placement, can recompute far more than necessary and slow training down more than needed. This balanced placement is what keeps the memory-for-compute tradeoff worthwhile rather than excessively costly.
3 / 5
In a code review, a dev notices the recomputed activations during the backward pass are generated using the exact same numeric precision as the original forward pass, rather than a different precision that could silently introduce a mismatch. What does this represent?
Precision consistency between the original forward pass and its recomputed activations ensures the recomputation during the backward pass produces numerically identical results to what was originally discarded. Recomputing at a different precision risks a subtle numerical mismatch between the original and recomputed values, which can quietly corrupt a gradient calculation. This consistency is essential, especially when checkpointing is combined with mixed-precision training.
4 / 5
An incident report shows training slowed down dramatically after activation checkpointing was enabled, because checkpoints had been placed at nearly every layer, forcing the backward pass to recompute almost the entire forward pass. What practice would prevent this?
Placing checkpoint boundaries deliberately at a sparser, well-chosen set of layers balances the memory saved against the extra recomputation cost incurred during the backward pass. Placing a checkpoint at nearly every layer forces the backward pass to redo almost the entire forward computation, exactly as this incident describes. This deliberate, sparser placement is what keeps activation checkpointing's tradeoff actually worthwhile.
5 / 5
During a PR review, a teammate asks why the team uses activation checkpointing instead of just buying more memory or using more GPUs to fit the same model and batch size. What is the reasoning?
Checkpointing trades a modest, well-bounded increase in compute time, typically around twenty to thirty percent, for a meaningfully smaller memory footprint. Buying more memory or additional GPUs solves the same problem but costs real money and often takes longer to provision than simply enabling a training-time technique. The tradeoff is the added training time checkpointing introduces, which needs to be weighed against the cost of scaling up hardware instead.