Master context window limits, KV cache mechanics, attention complexity, tokenization impact, and long-context strategies like RAG.
0 / 5 completed
1 / 5
What does the context window of an LLM define?
Context window: measured in tokens (not words or characters), the context window is the "working memory" of the model. A 128K-token context window can hold roughly 100,000 words. Once the limit is reached, earlier tokens are evicted or the generation fails — long-context strategies like chunking or retrieval address this.
2 / 5
What is a KV cache in transformer inference?
KV cache: during autoregressive generation, the model generates one token at a time. Without the KV cache, all previous token representations would be recomputed at every step — O(n²) work. The cache stores these tensors so only the new token's attention needs to be computed, making inference O(n) per step.
3 / 5
What challenge does attention mechanism scaling pose for long contexts?
Quadratic attention complexity: every token attends to every other token, so the attention matrix grows as n². Approaches like FlashAttention (tiled computation), sparse attention, and sliding window attention (used in Mistral) address this by avoiding materialising the full attention matrix.
4 / 5
What is tokenization and why does it matter for context window usage?
Tokenization: common English words are often one token, but rare words, code symbols, or non-Latin scripts may use many tokens per word. A 128K token context holds far fewer Chinese characters proportionally than English words. Understanding tokenization helps estimate whether content will fit within a model's window.
5 / 5
What is Retrieval-Augmented Generation (RAG) as a long-context strategy?
RAG: instead of putting all knowledge into the context (which may not fit), a vector database retrieves semantically similar chunks at query time. Only the top-k results are injected into the prompt. This keeps context usage efficient while giving the model access to a much larger knowledge base than any single context window allows.