What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Nucleus Sampling (Top-p)

Nucleus sampling, or top-p sampling, is a decoding strategy for language models that samples from the smallest set of tokens whose cumulative probability exceeds a threshold.

5 min readLast updated July 2026Foundations

Nucleus sampling, commonly called top-p sampling, is a decoding strategy used when a language model generates text. At each step a language model produces a probability distribution over the next possible token, and a decoding strategy decides how to turn that distribution into an actual choice. Nucleus sampling selects from the smallest set of the most probable tokens whose cumulative probability reaches a threshold p, such as 0.9 or 0.95. This dynamic cutoff lets the model balance diversity and coherence more effectively than fixed alternatives.

How it works

After the model computes probabilities for every token in its vocabulary, nucleus sampling sorts the tokens from most to least likely and accumulates their probabilities until the running total reaches the threshold p. The tokens gathered up to that point form the nucleus, and the next token is sampled from only this set after renormalising its probabilities. Everything outside the nucleus is discarded for that step.

The important property is that the size of the nucleus adapts to the shape of the distribution. When the model is confident and one or two tokens dominate, the nucleus is small and generation stays focused. When the model is uncertain and probability is spread across many plausible tokens, the nucleus grows and more options remain in play. This adaptivity is what distinguishes nucleus sampling from methods that keep a fixed number of candidates.

Comparison with other strategies

Nucleus sampling is usually understood in relation to three neighbouring techniques.

| Strategy | How it selects tokens | | --- | --- | | Greedy decoding | Always picks the single most probable token | | Beam search | Explores several high-probability sequences in parallel | | Top-k sampling | Samples from a fixed number k of most probable tokens | | Top-p (nucleus) | Samples from a variable set reaching cumulative probability p |

Top-k sampling keeps a constant number of candidates regardless of context, which can be too restrictive when many tokens are reasonable and too permissive when only a few are. Nucleus sampling avoids both failure modes by letting the candidate set expand and contract. Greedy decoding and beam search, by contrast, are largely deterministic and tend to produce repetitive or generic text in open-ended generation, which is why sampling methods are preferred for creative and conversational output.

Interaction with temperature

Nucleus sampling is frequently combined with a temperature parameter, which rescales the probability distribution before sampling. A higher temperature flattens the distribution and increases randomness, while a lower temperature sharpens it toward the most likely tokens. Temperature and top-p together handle the majority of practical use cases: temperature dials creativity up or down, while top-p ensures the output stays coherent by excluding the long tail of implausible tokens. Typical production settings pair a moderate temperature with a top-p value around 0.9 to 0.95.

Practical considerations

Choosing p involves a trade-off. Values near 1.0 admit almost the entire distribution and maximise diversity at the risk of incoherence, while low values approach greedy behaviour and can become repetitive. For factual question answering, developers often lower both temperature and top-p to favour reliability; for brainstorming, storytelling, or dialogue, they raise them to encourage variety. Because these parameters directly shape output quality, they are among the most commonly tuned knobs when deploying large language models.

Malaysian Context — Deploying Generative AI Responsibly

Decoding parameters such as nucleus sampling are practical levers for Malaysian organisations building applications on top of large language models. Banks including Maybank and CIMB, telecommunications operators such as TM and Maxis, and government service providers that deploy chatbots and document assistants must tune top-p and temperature to match the risk profile of each use case, favouring conservative settings for regulated financial or medical guidance and more permissive settings for marketing and content generation.

This tuning intersects with Malaysia's governance expectations. The Malaysia AI Governance and Ethics guidelines and the work of the National AI Office (NAIO) emphasise reliability and reduced hallucination, and constraining a model's sampling toward higher-probability tokens is one concrete technique teams use to limit unpredictable or fabricated output in customer-facing systems.

For Malaysia's multilingual environment, decoding settings also affect performance across Bahasa Malaysia, English, Mandarin, and Tamil, where token distributions differ and code-switching is common. Developers at local startups and at research bodies such as MIMOS experiment with these parameters when adapting models for Malaysian languages and mixed-language prompts.

Training programmes supported by MDEC and HRD Corp increasingly cover prompt engineering and inference configuration, so that Malaysian engineers understand not only which model to use but how to control its generation behaviour for safe, on-brand results.

References

Holtzman, A., et al. (2019). The Curious Case of Neural Text Degeneration. arXiv.
Raschka, S. (2024). How do temperature, top-k, and top-p sampling differ?. sebastianraschka.com.
Machine Learning Plus. (2025). LLM Temperature, Top-P, and Top-K Explained. machinelearningplus.com.

Tags:decoding text generation large language model sampling

Also known as	Top-p sampling
Type	Decoding strategy
Key parameter	p (e.g., 0.9)
Introduced	2019 (Holtzman et al.)
Key use	Controlling text generation quality
Related	Temperature, top-k, beam search