Learn the vocabulary of splitting raw text into subword units mapped to numeric identifiers a model can process.
0 / 5 completed
1 / 5
At standup, a dev mentions splitting raw input text into smaller units, such as subword pieces or whole words, each mapped to a numeric identifier a language model can actually process. What is this step called?
Tokenization is exactly this: tokenization splits raw input text into smaller units, often subword pieces rather than whole words, and maps each unit to a numeric identifier from a fixed vocabulary that a language model can actually process as input. A hash collision is an unrelated hash-table concept about two keys sharing a bucket. This split-into-numeric-units step is exactly why a language model can handle rare or unseen words at all, by breaking them down into familiar subword pieces instead of treating them as a single unknown token.
2 / 5
During a design review, the team picks a subword tokenizer for a language model, specifically because splitting rare or unseen words into familiar subword pieces avoids the vocabulary gaps a whole-word tokenizer would hit. Which capability does this provide?
A subword tokenizer here provides Graceful handling of rare and unseen words, since a rare word can be decomposed into smaller, familiar subword pieces already in the vocabulary instead of being mapped to a single generic unknown-word token that throws away all its meaning. A whole-word tokenizer has no such fallback and must treat any word missing from its fixed vocabulary as an unknown token. This decompose-into-familiar-pieces behavior is exactly why subword tokenization is the standard choice for modern language models.
3 / 5
In a code review, a dev notices a language-model input pipeline maps every word to a whole-word vocabulary entry, replacing any word missing from that vocabulary with a single generic unknown-word token, instead of decomposing rare words into familiar subword pieces. What does this represent?
This is a missed subword-tokenization opportunity, since decomposing rare words into familiar subword pieces would preserve meaningful structure instead of collapsing every out-of-vocabulary word into one generic unknown-word token. A cache eviction policy is an unrelated concept about discarded cache entries. This collapse-to-unknown-token pattern is exactly the kind of information loss a reviewer flags once rare or domain-specific words are common in the input.
4 / 5
An incident report shows a language model performed poorly on domain-specific text full of rare technical terms, because its tokenizer mapped every missing word to a single generic unknown-word token instead of decomposing rare words into familiar subword pieces. What practice would prevent this?
Switching to a subword tokenizer preserves meaningful structure by decomposing rare terms into familiar pieces instead of discarding them. Continuing to map every missing word to a single generic unknown-word token regardless of how many rare technical terms appear in the input is exactly what caused the issue described in this incident. This subword-decomposition approach is the standard fix once rare or domain-specific vocabulary is confirmed common in the input.
5 / 5
During a PR review, a teammate asks why the team reaches for subword tokenization instead of simple whitespace-based whole-word tokenization, given that whole-word tokenization is simpler to implement. What is the reasoning?
Subword tokenization decomposes rare or unseen words into smaller, familiar pieces, keeping the vocabulary size manageable while still representing almost any input, while whole-word tokenization is simpler but must either use an enormous vocabulary to cover every word or collapse missing words into a generic unknown-word token. This is exactly why subword tokenization is the standard choice for modern language models, while whole-word tokenization remains simpler but far less robust.