Build fluency in the vocabulary of estimating a distinct count over a huge stream with a small, fixed memory footprint.
0 / 5 completed
1 / 5
At standup, a dev mentions estimating the number of distinct items in a huge stream using only a small, fixed amount of memory, by hashing each item and tracking the longest run of leading zero bits observed across many hashed buckets. What is this structure called?
HyperLogLog is exactly this: HyperLogLog estimates the number of distinct items in a huge stream using only a small, fixed amount of memory, by hashing each item into one of many buckets and tracking, within each bucket, the longest run of leading zero bits observed, since a longer run statistically implies a larger number of distinct items have been hashed. A hash collision is an unrelated hash-table concept about two keys sharing a bucket. This fixed-memory, leading-zero-run tracking is exactly why HyperLogLog can estimate a distinct count in the billions using only a few kilobytes of memory.
2 / 5
During a design review, the team uses HyperLogLog to estimate the number of distinct visitors to a high-traffic service, specifically because tracking leading-zero-bit runs across a fixed number of buckets uses only a small, constant amount of memory, unlike storing every distinct visitor identifier seen. Which capability does this provide?
HyperLogLog here provides Constant-memory distinct-count estimation regardless of the true cardinality, since the number of buckets and the size of each bucket's counter stay fixed no matter how many distinct items are actually hashed, trading a small percentage estimation error for a memory footprint that never grows with the true distinct count. Storing an exact set of every distinct item ever seen costs memory that grows in direct proportion to the true distinct count. This bounded-memory, small-percentage-error tradeoff is exactly why HyperLogLog is favored for estimating distinct counts over an enormous stream where exact tracking would be impractical.
3 / 5
In a code review, a dev notices a distinct-visitor-counting feature over a high-traffic service stores an exact set of every visitor identifier ever seen, letting its memory usage grow in direct proportion to the true distinct count, instead of using a fixed-memory probabilistic estimator. What does this represent?
This is a missed HyperLogLog opportunity, since using HyperLogLog's fixed number of buckets and bounded per-bucket counters would estimate the distinct count using constant memory, at the cost of only a small percentage error. A cache eviction policy is an unrelated concept about discarded cache entries. This exact-set-storage pattern is exactly the kind of unbounded-memory risk a reviewer flags once the true distinct count is confirmed to be extremely large.
4 / 5
An incident report shows a distinct-visitor-counting service's memory usage grew without bound and became unsustainable, because it stored an exact set of every visitor identifier ever seen instead of using a fixed-memory probabilistic estimator. What practice would prevent this?
Switching to HyperLogLog's fixed-bucket, leading-zero-run estimation removes the unbounded memory growth entirely. Continuing to store an exact set of every distinct visitor identifier ever seen regardless of how large the true distinct visitor count actually grows is exactly what caused the issue described in this incident. This HyperLogLog approach is the standard fix once the true distinct count is confirmed large enough to make exact set storage impractical.
5 / 5
During a PR review, a teammate asks why the team reaches for HyperLogLog instead of an exact hash set of every distinct item, given that an exact hash set gives a perfectly accurate distinct count with no estimation error at all. What is the reasoning?
HyperLogLog bounds memory usage to a small, fixed footprint regardless of the true distinct count, at the cost of a small percentage estimation error, while an exact hash set gives a perfectly accurate count but its memory usage grows in direct proportion to the true distinct count. This is exactly why HyperLogLog is favored for estimating distinct counts over an enormous stream, while an exact hash set remains preferable when perfect accuracy is required and the true count stays manageable.