Learn the vocabulary of selecting a fixed number of uniformly random items from a stream of unknown length.
0 / 5 completed
1 / 5
At standup, a dev mentions selecting a fixed number of random items from a stream of unknown, possibly enormous length, without ever storing the whole stream, and with every item having an exactly equal chance of ending up in the final selection. What is this technique called?
Reservoir sampling is exactly this: it selects a fixed number of items uniformly at random from a stream of unknown, possibly enormous length, deciding whether to keep or discard each item the moment it's seen, without ever storing the whole stream, while still guaranteeing every item that ever passed through had an exactly equal chance of ending up in the final selection. A hash collision is an unrelated hash-table concept about two keys sharing a bucket. This single-pass, constant-memory guarantee is exactly why reservoir sampling is the standard technique for sampling from a live, unbounded, or streaming data source.
2 / 5
During a design review, the team picks reservoir sampling specifically so a random sample of a fixed size can be drawn from a stream whose total length isn't known in advance and is too large to store entirely in memory. Which capability does this provide?
Reservoir sampling here provides a uniformly random, fixed-size sample using only constant memory, since it only ever needs to hold the current sample of the chosen fixed size, deciding for each new item whether to swap it into the sample based on a simple probability calculation, regardless of how long the stream ultimately turns out to be. Storing every item from the entire stream first and sampling afterward would require memory proportional to the stream's full length, which may not even be possible for a truly unbounded or extremely large stream. This constant-memory guarantee is exactly what makes reservoir sampling practical for sampling from a live or massive data stream.
3 / 5
In a code review, a dev notices a feature meant to sample a fixed number of random log lines from an unbounded log stream instead buffers the entire stream into memory first and then randomly selects from that fully buffered collection. What does this represent?
This is a missed reservoir-sampling opportunity, since buffering the entire unbounded log stream into memory first risks running out of memory long before the stream ends, when reservoir sampling would maintain only a fixed-size sample using constant memory, updating it as each new line arrives without ever needing to store the whole stream. A cache eviction policy is an unrelated concept about discarded cache entries. This buffer-everything-first pattern is exactly the kind of scalability risk reservoir sampling is designed to eliminate for a genuinely unbounded or extremely large stream.
4 / 5
An incident report shows a log-sampling feature crashed with an out-of-memory error, because it buffered an entire, effectively unbounded log stream into memory before randomly selecting a fixed number of sample lines from that buffered collection. What practice would prevent this?
Using reservoir sampling to maintain a fixed-size sample with constant memory as each log line arrives means the feature never needs to buffer the full stream at all, which directly eliminates the out-of-memory crash described in this incident. Continuing to buffer the entire stream into memory before sampling regardless of the stream's length is exactly what caused the crash once the stream ran long enough. This constant-memory, single-pass sampling approach is the standard fix for drawing a random sample from any stream whose full length can't be safely assumed to fit in memory.
5 / 5
During a PR review, a teammate asks why the team reaches for reservoir sampling instead of just recording the stream's total length first and then making a second pass to randomly pick items at that point, given that both approaches can produce a uniformly random sample. What is the reasoning?
Reservoir sampling only ever needs a single pass over the data and constant memory, updating its fixed-size sample as each item arrives, while a two-pass approach requires either storing the entire stream so it can be replayed, defeating the memory savings, or a data source that can genuinely be read a second time, which isn't possible for a live, one-time stream like a real-time feed of incoming events. The tradeoff is that reservoir sampling's per-item probability calculation is slightly more involved to implement correctly than a naive random pick, but that complexity is exactly what buys the single-pass, constant-memory guarantee.