5 exercises — tokenization, pre-training vs. fine-tuning, LoRA adapters, KV cache mechanics, and LLM benchmarks (MMLU, HumanEval, GSM8K).
0 / 5 completed
1 / 5
What is tokenization in the context of LLMs?
LLM tokenization vocabulary:
A token is the basic unit of text that a language model processes. Tokens are not always whole words — they can be sub-word pieces, characters, or even punctuation.
Examples (GPT-style BPE tokenizer): • "hello" → 1 token • "unbelievable" → ["un", "believable"] → 2 tokens • " the" (with leading space) is a different token from "the" • Code and non-English text often tokenize less efficiently (more tokens per word)
Why tokenization matters: • API cost — most LLM APIs charge per token (input + output) • Context window — the maximum number of tokens a model can process at once • Latency — more tokens = slower generation
Common tokenizers: • BPE (Byte Pair Encoding) — used by GPT models; builds a vocabulary of sub-word units by merging frequent pairs • WordPiece — used by BERT; similar to BPE but uses a different merging criterion • SentencePiece — language-agnostic; used by LLaMA, T5 • Tiktoken — OpenAI's fast BPE tokenizer
Vocabulary: • vocabulary size — number of unique tokens the model knows (GPT-4: ~100k) • tokenize — convert text to tokens; detokenize — convert tokens back to text • special tokens — [BOS] (beginning of sequence), [EOS] (end), [PAD] (padding), [SEP] (separator) • tokens per second — throughput metric for LLM inference speed
2 / 5
What is the difference between pre-training and fine-tuning an LLM?
LLM training stages vocabulary:
Pre-training Training a model from scratch on a massive corpus (trillions of tokens from the web, books, code). The model learns general language understanding: grammar, facts, reasoning patterns, world knowledge. • Objective: next-token prediction (causal LM) or masked token prediction (BERT-style) • Cost: extremely expensive — GPT-4 pre-training estimated at $100M+ • Result: a "base model" (e.g. LLaMA-3, Mistral-7B-base)
Fine-tuning Continuing training of a pre-trained model on a smaller, task-specific dataset. Adapts the model's behavior without starting from scratch. • Supervised fine-tuning (SFT) — train on input/output pairs (instruction following) • Domain fine-tuning — adapt to medical, legal, or code-specific text • Cost: much lower than pre-training
RLHF (Reinforcement Learning from Human Feedback) Fine-tuning step used to align models with human preferences. Humans rank outputs → a reward model is trained → the LLM is optimised against the reward model using PPO.
Vocabulary: • base model — result of pre-training only; not instruction-tuned • instruction-tuned model — fine-tuned on instruction/response pairs (e.g. GPT-4, Claude, Llama-3-Instruct) • checkpoint — saved model weights at a point during training • catastrophic forgetting — fine-tuning can cause a model to lose previously learned capabilities • continual pre-training — additional pre-training on domain-specific data before fine-tuning
The problem LoRA solves: Full fine-tuning updates all model parameters — a 7B-parameter model requires enormous GPU memory. LoRA makes fine-tuning accessible on consumer hardware.
How LoRA works: For each weight matrix W in the model, LoRA adds two small matrices A and B (where rank r is much smaller than model dimension). Only A and B are trained; W stays frozen. • Delta-W = B x A (low-rank decomposition) • Typical rank r = 4, 8, or 16 (vs. full rank of hundreds/thousands) • Parameter reduction: a 7B model fine-tuned with LoRA might train only ~4M parameters vs. 7B
LoRA variants: • QLoRA — LoRA applied to a quantized (4-bit) base model; enables fine-tuning 65B models on a single GPU • DoRA — decomposes weights into magnitude and direction; improved performance • LoRA+ — different learning rates for A and B matrices
Vocabulary: • rank (r) — the inner dimension of the LoRA matrices; higher rank = more parameters = more expressive but more expensive • alpha — scaling factor for the LoRA update; effective learning rate for the adapter • adapter — generic term for small trainable modules added to a frozen base model • merge LoRA — combine the LoRA matrices back into the base weights for faster inference (no overhead) • PEFT — Parameter-Efficient Fine-Tuning; the umbrella category including LoRA, prefix tuning, adapters
Background — self-attention: In each transformer attention layer, every token produces three vectors: Query (Q), Key (K), and Value (V). Attention is computed as softmax(QK-transpose / sqrt(d)) multiplied by V.
The problem without KV cache: During autoregressive generation (generating one token at a time), the model would need to recompute K and V for all previous tokens on every step, leading to O(n²) computation.
KV cache solution: After computing K and V for each token, store them in a cache (GPU memory). On the next generation step, only the new token's K and V need to be computed and appended to the cache. • Reduces generation from O(n²) to O(n) per step • Enables fast streaming of long responses
KV cache trade-offs: • Memory cost — KV cache grows with sequence length and batch size; the main bottleneck for long-context inference • Context window — KV cache size limits how long the context can be
Vocabulary: • prefill — processing the entire input prompt at once (parallel); KV cache is populated • decode phase — generating output tokens one at a time using the KV cache • KV cache eviction — discarding old KV entries when cache is full (sliding window attention) • paged attention — vLLM's technique to manage KV cache like OS virtual memory, enabling efficient batching • flash attention — memory-efficient attention computation that reduces KV cache memory movement
5 / 5
What does MMLU measure?
LLM benchmarks vocabulary:
MMLU (Massive Multitask Language Understanding) A benchmark of 57 subjects including mathematics, history, law, medicine, computer science, ethics, and more. Tests factual knowledge and reasoning. Results reported as percentage of correct multiple-choice answers.
Other key LLM benchmarks:
HumanEval — coding ability benchmark; model must write Python functions that pass unit tests; measures pass@k (probability of passing k attempts)
GSM8K — grade-school math word problems; tests multi-step arithmetic reasoning
HellaSwag — commonsense reasoning; complete a sentence from 4 options
TruthfulQA — measures whether a model gives truthful answers vs. common misconceptions
MT-Bench — multi-turn conversation quality, judged by GPT-4
LMSYS Chatbot Arena — human preference ranking via pairwise comparisons (Elo rating)
Vocabulary: • benchmark — standardized test for comparing model capabilities • few-shot — benchmark evaluated with a few examples provided in the prompt (5-shot MMLU) • zero-shot — benchmark with no examples in the prompt • leaderboard — ranking of models by benchmark scores (Open LLM Leaderboard on HuggingFace) • contamination — when benchmark questions appear in the training data, inflating scores • evals — general term for evaluation datasets and metrics used to assess model capability