Learn the vocabulary of a sorted array of every suffix's starting position, enabling fast substring search.
0 / 5 completed
1 / 5
At standup, a dev mentions a sorted array listing the starting position of every suffix of a string, letting any substring search be answered with a binary search over that sorted array instead of scanning the whole string. What is this structure called?
A suffix array is exactly this: a sorted array listing the starting position of every suffix of a string, so any substring search can be answered by binary-searching this sorted array for the range of suffixes that begin with the target substring, instead of scanning the entire original string directly. A hash collision is an unrelated hash-table concept about two keys sharing a bucket. This sorted-suffixes-plus-binary-search structure is exactly why a suffix array supports fast substring search over a large, fixed piece of text.
2 / 5
During a design review, the team builds a suffix array over a large, static text corpus specifically so a substring search can be answered in logarithmic time via binary search instead of scanning the entire corpus for every query. Which capability does this provide?
A suffix array here provides much faster repeated substring searches, since once it's built, each query only needs a binary search over the sorted suffixes to find the range matching the target substring, which costs logarithmic time in the corpus's size instead of a full linear scan through the entire corpus for every single query. Scanning the entire corpus directly for every query would cost far more, especially once many searches are run against the same, unchanging corpus. This binary-search-driven speedup is exactly why a suffix array pays off for a large, static text corpus that's queried repeatedly.
3 / 5
In a code review, a dev notices a substring-search feature over a large, unchanging document scans the entire document directly for every single query, with no precomputed structure at all supporting faster lookups. What does this represent?
This is a missed suffix-array opportunity, since scanning the entire document directly on every single query costs time proportional to the document's size for each search, when a precomputed suffix array would let the same query be answered with a binary search costing only logarithmic time. A cache eviction policy is an unrelated concept about discarded cache entries. This scan-every-time pattern is exactly the kind of avoidable inefficiency a suffix array is designed to eliminate for a large, unchanging document that's searched repeatedly.
4 / 5
An incident report shows a document-search feature over a large, unchanging corpus slowed sharply as search volume grew, because every substring query scanned the entire corpus directly instead of using any precomputed structure to speed up repeated lookups. What practice would prevent this?
Building a suffix array over the corpus once lets every subsequent substring query be answered with a fast binary search instead of a full scan of the entire corpus, which is exactly the fix for the slowdown described in this incident. Continuing to scan the entire corpus directly on every query regardless of growing search volume is exactly what caused performance to degrade sharply. This precomputed-structure fix is the standard approach once a large, unchanging corpus is confirmed to be searched frequently enough that repeated full scans no longer scale.
5 / 5
During a PR review, a teammate asks why the team builds a suffix array instead of just using a simple hash-based index of substrings, given that a hash-based index can also speed up some kinds of lookups. What is the reasoning?
A suffix array supports searching for any substring of any length with a single binary search over the sorted suffixes, since a substring is simply a prefix of one or more of those suffixes, while a hash-based index of substrings would need to either fix a specific substring length in advance or store a separate index entry for every possible length, which grows impractical the moment the search needs to support arbitrary, unspecified substring lengths. The tradeoff is the suffix array's own construction cost and memory footprint, which pays off precisely for the kind of general, arbitrary-length substring search a fixed-length hash index can't handle as cleanly.