Build fluency in the vocabulary of tolerating a short-notice reclaim of discounted compute capacity.
0 / 5 completed
1 / 5
At standup, a dev mentions a cloud provider reclaiming a discounted compute instance on short notice whenever it needs that capacity back for a full-price customer. What is this event called?
A spot instance interruption is a cloud provider reclaiming a discounted compute instance on short notice, typically just a couple of minutes, whenever it needs that capacity back for a full-price, on-demand customer. A scheduled maintenance window announced weeks in advance is a completely different, much more predictable kind of disruption. This short-notice reclaiming is the fundamental tradeoff that comes with a spot instance's steep discount compared to an on-demand instance.
2 / 5
During a design review, the team wants the application to receive an interruption notice a couple of minutes before an instance is actually reclaimed, so it can gracefully drain its in-flight work. Which capability supports this?
The interruption notice warning is delivered a couple of minutes before a spot instance is actually reclaimed, giving the running application a short window to gracefully drain its in-flight work before it's cut off. A spot instance being reclaimed with absolutely no warning would give an application no chance at all to save its progress or hand off its work. This notice is the mechanism that makes running a genuinely interruptible workload on spot capacity practical rather than reckless.
3 / 5
In a code review, a dev notices a batch job checkpoints its progress periodically to durable storage, so it can resume from the last checkpoint rather than restarting from scratch after an interruption. What does this represent?
Checkpointing progress to durable storage lets a batch job resume from its last saved checkpoint rather than restarting from scratch after a spot instance interruption. Keeping all progress only in the interrupted instance's own memory loses everything the moment that instance is reclaimed. This checkpointing discipline is what makes a long-running batch workload resilient to the unpredictable, short-notice nature of a spot interruption.
4 / 5
An incident report shows a long-running batch job lost several hours of processed work every time a spot instance was reclaimed, because the job kept all of its progress only in that instance's own memory with no checkpoint ever written elsewhere. What practice would prevent this?
Checkpointing the job's progress periodically to durable storage lets it resume from the last saved point after an interruption, rather than losing everything processed since the job started. Keeping all progress only in the instance's own memory with no checkpoint is exactly what caused the hours of lost work this incident describes. This checkpointing discipline is essential for any long-running workload that intends to run economically on interruptible spot capacity.
5 / 5
During a PR review, a teammate asks why the team runs a batch workload on spot instances with checkpointing instead of just using a full-price on-demand instance that's never subject to interruption. What is the reasoning?
A spot instance costs meaningfully less than an on-demand instance, often a significant discount, and checkpointing lets the batch job tolerate the occasional short-notice interruption without losing significant progress toward its overall work. The tradeoff is the added engineering effort of building the job to checkpoint and resume cleanly, which an on-demand-only workload doesn't need to bother with at all.