Learn the vocabulary of computing the minimum number of edits needed to transform one string into another.
0 / 5 completed
1 / 5
At standup, a dev mentions computing the minimum number of single-character insertions, deletions, and substitutions needed to transform one string into another, by building up a table of best answers for every pair of prefixes of the two strings. What is this concept called?
Edit distance is exactly this: edit distance is the minimum number of single-character insertions, deletions, and substitutions needed to transform one string into another, and its standard dynamic-programming solution builds up a table of the minimum cost for every pair of prefixes of the two strings. A hash collision is an unrelated hash-table concept about two keys sharing a bucket. This build-up-over-every-prefix-pair approach is exactly why edit distance can be computed in time proportional to the product of the two strings' lengths, instead of trying every possible sequence of edits.
2 / 5
During a design review, the team computes edit distance to power a fuzzy, typo-tolerant search feature, specifically because building up a table over every pair of prefixes lets nearby but not-identical strings be scored by how many edits separate them, without trying every possible sequence of edits. Which capability does this provide?
Edit distance here provides Efficient fuzzy-matching by scoring exactly how many edits separate two strings, since the minimum edit cost for a given pair of prefixes only depends on minimum costs already computed for slightly shorter prefixes, so dynamic programming solves each distinct sub-problem once and reuses it, instead of trying every possible sequence of edits. Trying every possible sequence of insertions, deletions, and substitutions costs exponential time as the strings get longer. This reuse-of-overlapping-prefix-sub-problems behavior is exactly why edit distance underlies practical spell-checkers and typo-tolerant search features.
3 / 5
In a code review, a dev notices a typo-tolerant search feature tries every possible sequence of insertions, deletions, and substitutions to see how close a query is to a stored term, instead of building up a table of minimum edit costs over every pair of prefixes. What does this represent?
This is a missed edit-distance opportunity, since building up a table over every pair of prefixes, the way the edit-distance dynamic-programming solution does, would reuse overlapping sub-problems instead of trying every possible sequence of edits. A cache eviction policy is an unrelated concept about discarded cache entries. This brute-force-enumeration pattern is exactly the kind of avoidable exponential cost a reviewer flags once queries and terms are long enough for it to matter.
4 / 5
An incident report shows a typo-tolerant search feature's response time exploded on moderately long queries, because it tried every possible sequence of edits individually instead of building up a table of minimum edit costs over prefix pairs. What practice would prevent this?
Rewriting the comparison logic using edit distance's dynamic-programming approach eliminates the exponential edit-sequence enumeration. Continuing to try every possible sequence of edits to compare a query and a stored term regardless of how long the queries and stored terms are is exactly what caused the issue described in this incident. This dynamic-programming approach is the standard fix once brute-force edit-sequence enumeration is shown to be the actual bottleneck.
5 / 5
During a PR review, a teammate asks why the team reaches for edit distance instead of the longest-common-subsequence algorithm, given that longest common subsequence also compares two strings with a similar dynamic-programming table. What is the reasoning?
Edit distance counts the minimum number of insertions, deletions, and substitutions needed to transform one string into another, which is exactly what a typo-tolerance score needs, while the longest common subsequence instead finds the longest shared, in-order sequence of characters, a related but different question better suited to highlighting unchanged content in a line-based diff. This is exactly why the two dynamic-programming techniques, despite similar tables, serve different practical purposes.