Nucleus Sampling (Top-p)
Nucleus sampling, or top-p sampling, is a decoding strategy for language models that samples from the smallest set of tokens whose cumulative probability exceeds a threshold.
Nucleus sampling, commonly called top-p sampling, is a decoding strategy used when a language model generates text. At each step a language model produces a probability distribution over the next possible token, and a decoding strategy decides how to turn that distribution into an actual choice. Nucleus sampling selects from the smallest set of the most probable tokens whose cumulative probability reaches a threshold p, such as 0.9 or 0.95. This dynamic cutoff lets the model balance diversity and coherence more effectively than fixed alternatives.
How it works
After the model computes probabilities for every token in its vocabulary, nucleus sampling sorts the tokens from most to least likely and accumulates their probabilities until the running total reaches the threshold p. The tokens gathered up to that point form the nucleus, and the next token is sampled from only this set after renormalising its probabilities. Everything outside the nucleus is discarded for that step.
The important property is that the size of the nucleus adapts to the shape of the distribution. When the model is confident and one or two tokens dominate, the nucleus is small and generation stays focused. When the model is uncertain and probability is spread across many plausible tokens, the nucleus grows and more options remain in play. This adaptivity is what distinguishes nucleus sampling from methods that keep a fixed number of candidates.
Comparison with other strategies
Nucleus sampling is usually understood in relation to three neighbouring techniques.
| Strategy | How it selects tokens | | --- | --- | | Greedy decoding | Always picks the single most probable token | | Beam search | Explores several high-probability sequences in parallel | | Top-k sampling | Samples from a fixed number k of most probable tokens | | Top-p (nucleus) | Samples from a variable set reaching cumulative probability p |
Top-k sampling keeps a constant number of candidates regardless of context, which can be too restrictive when many tokens are reasonable and too permissive when only a few are. Nucleus sampling avoids both failure modes by letting the candidate set expand and contract. Greedy decoding and beam search, by contrast, are largely deterministic and tend to produce repetitive or generic text in open-ended generation, which is why sampling methods are preferred for creative and conversational output.
Interaction with temperature
Nucleus sampling is frequently combined with a temperature parameter, which rescales the probability distribution before sampling. A higher temperature flattens the distribution and increases randomness, while a lower temperature sharpens it toward the most likely tokens. Temperature and top-p together handle the majority of practical use cases: temperature dials creativity up or down, while top-p ensures the output stays coherent by excluding the long tail of implausible tokens. Typical production settings pair a moderate temperature with a top-p value around 0.9 to 0.95.
Practical considerations
Choosing p involves a trade-off. Values near 1.0 admit almost the entire distribution and maximise diversity at the risk of incoherence, while low values approach greedy behaviour and can become repetitive. For factual question answering, developers often lower both temperature and top-p to favour reliability; for brainstorming, storytelling, or dialogue, they raise them to encourage variety. Because these parameters directly shape output quality, they are among the most commonly tuned knobs when deploying large language models.
References
- Holtzman, A., et al. (2019). The Curious Case of Neural Text Degeneration. arXiv.
- Raschka, S. (2024). How do temperature, top-k, and top-p sampling differ?. sebastianraschka.com.
- Machine Learning Plus. (2025). LLM Temperature, Top-P, and Top-K Explained. machinelearningplus.com.