Temperature, top-p nucleus sampling, top-k, repetition penalty, frequency penalty, presence penalty, max tokens, and stop sequences.
Key vocabulary
Temperature — controls randomness; low values (e.g., 0.1) make output more deterministic, high values (e.g., 1.2) make it more creative and varied.
Top-p (nucleus sampling) — restricts sampling to the smallest set of tokens whose cumulative probability exceeds p; balances diversity and coherence.
Top-k sampling — restricts sampling to the k most probable next tokens at each step.
Repetition penalty — reduces the probability of tokens that have already appeared in the output, discouraging repetitive text.
Stop sequences — strings that cause the model to halt generation when encountered (e.g., "\n\n", "END").
0 / 5 completed
1 / 5
A colleague sets temperature = 0.1 for a code generation task. What effect does this have?
Temperature scales the logits before the softmax step. A low temperature (near 0) sharpens the distribution, making the model almost always pick the highest-probability token — producing consistent, predictable output. This is ideal for code generation, data extraction, and structured tasks. High temperature (above 1) flattens the distribution, encouraging more diverse and creative but potentially less accurate responses.
2 / 5
What does top-p = 0.9 (nucleus sampling) mean in practice?
Top-p nucleus sampling (Holtzman et al., 2020) dynamically adjusts the candidate set. With top-p = 0.9, the model sorts tokens by probability and keeps adding them until the cumulative probability hits 0.9, then samples from that nucleus. This adapts to the model's confidence: when it is very sure, only a few tokens form the nucleus; when uncertain, more tokens are included.
3 / 5
A developer sets a high repetition penalty. What problem are they solving?
A repetition penalty discounts the probability of tokens that have already appeared in the generated text. This prevents the degenerate "looping" behaviour where models get stuck repeating the same phrase. The frequency penalty and presence penalty in the OpenAI API are variants: frequency penalty scales with how often a token has appeared; presence penalty applies a flat discount to any token seen at least once.
4 / 5
What are stop sequences used for in LLM API calls?
Stop sequences are strings you pass in the API call (e.g., stop=["\n\n", "User:"]). When the model generates one of these strings, generation halts immediately. They are critical for structured output tasks: for example, stopping at "\n" to get a single-line completion, or stopping at "```" to close a code block. They complement max_tokens for output length control.
5 / 5
How does top-k sampling differ from top-p (nucleus) sampling?
Top-k keeps only the k most probable tokens regardless of their probability values — so if k=40, you always get exactly 40 candidates. Top-p adapts: if the model is very confident, the nucleus might contain only 3 tokens; if uncertain, it might include 200. In practice, many systems combine both (e.g., top-k=40 AND top-p=0.9), applying whichever is more restrictive.