Build fluency in the vocabulary of retrieving data by a hash of its own contents instead of an assigned path.
0 / 5 completed
1 / 5
At standup, a dev mentions a storage system where a piece of data is retrieved using a hash computed from its own contents, rather than a file path or an assigned identifier chosen ahead of time. What is this storage approach called?
Content-addressable storage is exactly this: a piece of data is retrieved using a hash computed directly from its own contents, rather than a file path or an identifier assigned ahead of time, which means two pieces of identical data always end up with the exact same address, and any change to the data changes its address entirely. A hash collision is an unrelated hash-table concept about two keys sharing a bucket. This contents-derived addressing is exactly the mechanism git and systems like it use to store and identify objects.
2 / 5
During a design review, the team relies on content-addressable storage specifically so two identical pieces of data, uploaded independently by different users, are automatically stored only once instead of duplicated. Which capability does this provide?
Content-addressable storage here provides automatic deduplication, since identical content always produces the exact same hash-derived address no matter who uploads it or when, so a second upload of data that's already stored is recognized immediately and never needs to be duplicated. Giving every uploaded piece of data its own separate, freshly assigned identifier regardless of its actual contents would store the exact same data twice, or many times, without ever recognizing the duplication. This automatic-deduplication behavior is exactly why content-addressable storage is favored for systems storing large amounts of data with a lot of overlapping content, like version-control object stores.
3 / 5
In a code review, a dev notices a file-upload feature assigns every uploaded file a freshly generated, random identifier with no relationship to the file's actual contents, so uploading the exact same file twice stores two full, separate copies. What does this represent?
This is a missed content-addressable-storage opportunity, since assigning every upload a freshly generated, random identifier with no relationship to its actual contents means the system has no way to recognize that a second upload is identical to something already stored, when deriving the identifier from a hash of the file's own contents would let an identical re-upload be recognized instantly and stored only once. A cache eviction policy is an unrelated concept about discarded cache entries. This random-identifier pattern is exactly the kind of storage waste content-addressable storage is designed to eliminate.
4 / 5
An incident report shows a file-storage service's disk usage grew far faster than expected, because uploaded files were assigned freshly generated, random identifiers unrelated to their contents, so the same popular file, re-uploaded by many different users, was stored as a full, separate copy every single time. What practice would prevent this?
Switching to content-addressable storage, deriving each file's identifier from a hash of its own contents, means an identical re-upload of the same popular file is instantly recognized as already stored and never gets duplicated, which is exactly the fix for the disk-usage growth described in this incident. Continuing to assign every upload a fresh, random identifier regardless of its actual contents is exactly what let the same popular file get stored as a full separate copy every time it was re-uploaded. This contents-derived addressing is the standard fix for exactly this kind of avoidable storage duplication in a file-upload service.
5 / 5
During a PR review, a teammate asks why the team derives every stored file's identifier from a hash of its contents instead of just letting the uploading client choose whatever identifier it wants. What is the reasoning?
Letting a client choose its own identifier provides no guarantee at all that identical content across two different uploads is ever recognized as the same data, since two clients could easily choose different identifiers for byte-for-byte identical files, and it also opens the door to one client's chosen identifier accidentally, or maliciously, colliding with a completely different file another client already stored under that same name. Deriving the identifier from a hash of the file's actual contents instead guarantees identical data always maps to the same address and unrelated data essentially never collides. The tradeoff is the computational cost of hashing every upload, which is a small, worthwhile price for both automatic deduplication and collision safety.