What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Tokenisation

Tokenisation is the process of breaking text into discrete units called tokens — which may represent words, subwords, characters, or symbols — that serve as the fundamental input units for language models and other natural language processing systems.

6 min readLast updated May 2026Foundations

Tokenisation (also spelled "tokenization" in American English) is the process of decomposing a text string into a sequence of discrete units called tokens, which serve as the atomic inputs to language models and most natural language processing systems. Because machine learning models operate on numerical representations rather than raw text, tokenisation is the essential bridge between human-readable language and the numerical computations performed inside a model: the tokeniser assigns each token a unique integer identifier from a fixed vocabulary, and that integer is then converted to a vector embedding for downstream processing.[^1] The design of the tokenisation scheme — which granularity of unit to use, how to handle unknown words, and how to represent languages with different morphological structures — has substantial practical consequences for model vocabulary size, training efficiency, multilingual capability, and inference cost.

Why Text Must Be Tokenised

Neural language models do not process text as a continuous stream of characters the way a human reader might. Instead, they require a fixed-size vocabulary of discrete symbols, each associated with a learnable vector embedding. If each unique word in the training corpus were a separate vocabulary entry, vocabularies would number in the millions — too large to train efficiently — and any word not seen during training would be unrepresentable. If each character were a separate token, sequences would be extremely long, straining the model's context window and requiring the model to learn word-level semantics entirely from character combinations. Tokenisation strategies balance these competing pressures by operating at an intermediate level of granularity — typically the subword level — that keeps vocabularies manageable while handling rare and out-of-vocabulary words gracefully.[^2]

Tokenisation Algorithms

Byte Pair Encoding (BPE)

BPE, originally developed for data compression, is the most widely used tokenisation algorithm in large language models. Beginning with a vocabulary of individual characters or bytes, BPE iteratively merges the most frequently co-occurring pair of adjacent tokens into a single new token, repeating until the vocabulary reaches the desired size. The result is a vocabulary dominated by common words (which become single tokens) and equipped with subword fragments for rarer words. GPT-4 and Llama use variants of BPE, with vocabulary sizes around 100,000–128,000 tokens.

WordPiece

WordPiece is BPE's closest relative and is used in BERT, the influential Transformer encoder model developed by Google. Rather than merging the most frequent pair, WordPiece selects the pair that most increases the likelihood of the training corpus under a language model. Subwords added to the vocabulary through splits are prefixed with ## to indicate continuation (e.g., "running" → "run" + "##ning"). WordPiece vocabularies typically contain approximately 30,000 tokens.

SentencePiece

SentencePiece, developed by Google, treats the input text as a raw byte sequence without assuming any prior whitespace-based word segmentation. This makes it language-agnostic — suitable for languages such as Chinese, Japanese, Thai, and Malay that lack clear word boundaries in their written forms. SentencePiece is used in many multilingual models and in models designed for low-resource languages. It supports both BPE and Unigram language model tokenisation strategies.

Unigram Language Model

The Unigram tokenisation algorithm takes a different approach: starting with a large initial vocabulary, it iteratively removes tokens whose removal least reduces the training corpus likelihood, until the vocabulary reaches the target size. Unigram tokenisation tends to produce more probabilistically principled segmentations, particularly for morphologically complex languages.

Tokens in Practice

The relationship between characters, words, and tokens varies by language and content type. In English, one token typically corresponds to approximately four characters or three-quarters of a word, meaning that 100 tokens represent roughly 75 words of English text. Code tends to tokenise less efficiently than prose because programming syntax includes many special characters and unconventional spacing. Non-Latin scripts — including Arabic, Chinese, Japanese, Korean, and Thai — may tokenise less efficiently than English in models whose vocabulary was trained predominantly on English data, meaning that equivalent passages in these languages consume more tokens and thus more of the model's context window.

This asymmetry has practical pricing implications for users of commercial language model APIs, which charge per token of input and output. A business deploying a language model for Malay-language customer service may incur higher costs per conversation than an equivalent English deployment if the tokeniser is not optimised for Malay vocabulary.

Tokenisation and Context Windows

The context window of a language model — the maximum amount of text it can process in a single forward pass — is measured in tokens, not words or characters. A model with a 128,000-token context window can process approximately 96,000 words of English text. Understanding tokenisation is therefore essential for developers building applications that need to reason about how much content can fit in a single model call, or for estimating the API cost of a given workload.

Malaysian Context — Multilingual Tokenisation and Bahasa Malaysia

Tokenisation has direct implications for the quality and cost of AI applications in Malaysia, where the primary written languages include Bahasa Malaysia (Malay), English, Mandarin Chinese, and Tamil. Most large language models were pre-trained on corpora heavily dominated by English, and their tokenisers were optimised accordingly. This means that Bahasa Malaysia text typically tokenises less efficiently than English — requiring more tokens per word — because Malay vocabulary is less well-represented in BPE merge tables.

Research conducted at Universiti Teknologi Malaysia (UTM) and Universiti Kebangsaan Malaysia (UKM) has examined the performance gap between English and Malay in standard language models, attributing part of the gap to sub-optimal tokenisation. Malay's agglutinative morphology — in which prefixes and suffixes are extensively combined (e.g., "mempermasalahkan") — means that individual tokens often capture only fragments of meaningful morphemes, degrading downstream task performance. Efforts to develop Malay-optimised tokenisers using SentencePiece and BPE trained on large Malay corpora have been undertaken by research groups at UTM and by Malaysian AI startup ecosystems supported through MDEC.

The Malaysian government's push for sovereign AI capability, articulated in the National AI Roadmap, includes a focus on developing locally trained models and tokenisers that handle Bahasa Malaysia, Sabah and Sarawak regional languages, and Malaysian English (Manglish) effectively. MIMOS Berhad, the national applied research centre, has worked on Malay-language NLP infrastructure including tokenisation tools distributed to researchers through its open-source initiatives.

For Malaysian businesses using commercial LLM APIs — including Azure OpenAI, Google Vertex AI, and AWS Bedrock — understanding tokenisation efficiency is a commercial consideration. Conducting customer service or document processing in Bahasa Malaysia using models with English-optimised tokenisers can result in significantly higher per-transaction costs. Selecting or fine-tuning models with multilingual tokenisers such as SentencePiece-based variants is a practical optimisation available to cost-conscious deployments.

References

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909. ACL 2016.
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv:1808.06226. EMNLP 2018.
Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. arXiv:1804.10959. ACL 2018.
DataCamp. (2024). Tokenization in NLP: How It Works, Challenges, and Use Cases. DataCamp Blog.

Tags:tokenisation tokenization nlp language-models

Type	Text pre-processing technique
Output unit	Token (word, subword, character, or byte)
Common algorithms	BPE, WordPiece, SentencePiece, Unigram
Used in	All language models, search engines, NLP pipelines
Related	Embedding, context window, natural language processing