IntermediateVocabulary#data-science-ml#backend#developer-tools

Tokenization Vocabulary

Learn the vocabulary of splitting raw text into subword units mapped to numeric identifiers a model can process.

0 / 5 completed

1 / 5

At standup, a dev mentions splitting raw input text into smaller units, such as subword pieces or whole words, each mapped to a numeric identifier a language model can actually process. What is this step called?

2 / 5

During a design review, the team picks a subword tokenizer for a language model, specifically because splitting rare or unseen words into familiar subword pieces avoids the vocabulary gaps a whole-word tokenizer would hit. Which capability does this provide?

3 / 5

In a code review, a dev notices a language-model input pipeline maps every word to a whole-word vocabulary entry, replacing any word missing from that vocabulary with a single generic unknown-word token, instead of decomposing rare words into familiar subword pieces. What does this represent?

4 / 5

An incident report shows a language model performed poorly on domain-specific text full of rare technical terms, because its tokenizer mapped every missing word to a single generic unknown-word token instead of decomposing rare words into familiar subword pieces. What practice would prevent this?

5 / 5

During a PR review, a teammate asks why the team reaches for subword tokenization instead of simple whitespace-based whole-word tokenization, given that whole-word tokenization is simpler to implement. What is the reasoning?