AIWiki
Malaysia

Tokenisation

Tokenisation is the process of breaking text into discrete units called tokens — which may represent words, subwords, characters, or symbols — that serve as the fundamental input units for language models and other natural language processing systems.

6 min readLast updated May 2026Foundations

Tokenisation (also spelled "tokenization" in American English) is the process of decomposing a text string into a sequence of discrete units called tokens, which serve as the atomic inputs to language models and most natural language processing systems. Because machine learning models operate on numerical representations rather than raw text, tokenisation is the essential bridge between human-readable language and the numerical computations performed inside a model: the tokeniser assigns each token a unique integer identifier from a fixed vocabulary, and that integer is then converted to a vector embedding for downstream processing.[^1] The design of the tokenisation scheme — which granularity of unit to use, how to handle unknown words, and how to represent languages with different morphological structures — has substantial practical consequences for model vocabulary size, training efficiency, multilingual capability, and inference cost.

Why Text Must Be Tokenised

Neural language models do not process text as a continuous stream of characters the way a human reader might. Instead, they require a fixed-size vocabulary of discrete symbols, each associated with a learnable vector embedding. If each unique word in the training corpus were a separate vocabulary entry, vocabularies would number in the millions — too large to train efficiently — and any word not seen during training would be unrepresentable. If each character were a separate token, sequences would be extremely long, straining the model's context window and requiring the model to learn word-level semantics entirely from character combinations. Tokenisation strategies balance these competing pressures by operating at an intermediate level of granularity — typically the subword level — that keeps vocabularies manageable while handling rare and out-of-vocabulary words gracefully.[^2]

Tokenisation Algorithms

Byte Pair Encoding (BPE)

BPE, originally developed for data compression, is the most widely used tokenisation algorithm in large language models. Beginning with a vocabulary of individual characters or bytes, BPE iteratively merges the most frequently co-occurring pair of adjacent tokens into a single new token, repeating until the vocabulary reaches the desired size. The result is a vocabulary dominated by common words (which become single tokens) and equipped with subword fragments for rarer words. GPT-4 and Llama use variants of BPE, with vocabulary sizes around 100,000–128,000 tokens.

WordPiece

WordPiece is BPE's closest relative and is used in BERT, the influential Transformer encoder model developed by Google. Rather than merging the most frequent pair, WordPiece selects the pair that most increases the likelihood of the training corpus under a language model. Subwords added to the vocabulary through splits are prefixed with ## to indicate continuation (e.g., "running" → "run" + "##ning"). WordPiece vocabularies typically contain approximately 30,000 tokens.

SentencePiece

SentencePiece, developed by Google, treats the input text as a raw byte sequence without assuming any prior whitespace-based word segmentation. This makes it language-agnostic — suitable for languages such as Chinese, Japanese, Thai, and Malay that lack clear word boundaries in their written forms. SentencePiece is used in many multilingual models and in models designed for low-resource languages. It supports both BPE and Unigram language model tokenisation strategies.

Unigram Language Model

The Unigram tokenisation algorithm takes a different approach: starting with a large initial vocabulary, it iteratively removes tokens whose removal least reduces the training corpus likelihood, until the vocabulary reaches the target size. Unigram tokenisation tends to produce more probabilistically principled segmentations, particularly for morphologically complex languages.

Tokens in Practice

The relationship between characters, words, and tokens varies by language and content type. In English, one token typically corresponds to approximately four characters or three-quarters of a word, meaning that 100 tokens represent roughly 75 words of English text. Code tends to tokenise less efficiently than prose because programming syntax includes many special characters and unconventional spacing. Non-Latin scripts — including Arabic, Chinese, Japanese, Korean, and Thai — may tokenise less efficiently than English in models whose vocabulary was trained predominantly on English data, meaning that equivalent passages in these languages consume more tokens and thus more of the model's context window.

This asymmetry has practical pricing implications for users of commercial language model APIs, which charge per token of input and output. A business deploying a language model for Malay-language customer service may incur higher costs per conversation than an equivalent English deployment if the tokeniser is not optimised for Malay vocabulary.

Tokenisation and Context Windows

The context window of a language model — the maximum amount of text it can process in a single forward pass — is measured in tokens, not words or characters. A model with a 128,000-token context window can process approximately 96,000 words of English text. Understanding tokenisation is therefore essential for developers building applications that need to reason about how much content can fit in a single model call, or for estimating the API cost of a given workload.

References

  1. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909. ACL 2016.
  2. Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv:1808.06226. EMNLP 2018.
  3. Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. arXiv:1804.10959. ACL 2018.
  4. DataCamp. (2024). Tokenization in NLP: How It Works, Challenges, and Use Cases. DataCamp Blog.