AdvancedVocabulary#data-science-ml#backend#developer-tools

Transformer Architecture Vocabulary

Build fluency in the vocabulary of processing a sequence in parallel using self-attention across every token pair.

0 / 5 completed

1 / 5

At standup, a dev mentions a neural network architecture that processes an entire sequence at once using self-attention to weigh how much every token should attend to every other token, instead of processing tokens one at a time in order. What is this architecture called?

2 / 5

During a design review, the team picks a transformer architecture for a long-sequence language task, specifically because self-attention lets every token directly relate to every other token in one layer, instead of information having to pass step by step through a long sequential chain. Which capability does this provide?

3 / 5

In a code review, a dev notices a long-sequence language feature processes tokens one at a time through a strictly sequential recurrent layer, forcing distant relationships to propagate step by step, instead of using self-attention to connect every pair of tokens directly. What does this represent?

4 / 5

An incident report shows a language model struggled to capture relationships between distant tokens in long input sequences, because it used a strictly sequential recurrent layer instead of self-attention connecting every pair of tokens directly. What practice would prevent this?

5 / 5

During a PR review, a teammate asks why the team reaches for a transformer architecture instead of a recurrent neural network, given that recurrent networks were the standard for sequence modeling for years. What is the reasoning?