What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Token

A token is the smallest unit of text processed by a large language model, typically representing a word, subword, or character used as the fundamental input and output element during inference.

6 min readLast updated June 2026Foundations

A token is the fundamental unit of text that a large language model (LLM) reads and produces. Before any text can be processed by a neural network, it must be converted from a raw string into a sequence of discrete numerical identifiers — each of these identifiers corresponds to a token. The process of converting text into tokens is called tokenisation, and the reverse process is called detokenisation.

Definition and Scope

In the context of LLMs, a token does not map neatly onto a human reading unit such as a word or sentence. Instead, a token is a contiguous substring of text defined by a vocabulary that the model was trained on. Common tokenisation schemes — such as Byte Pair Encoding (BPE) or SentencePiece — learn a vocabulary of tokens by iteratively merging the most frequent character pairs in a training corpus. The result is that common English words are typically a single token ("the", "is", "run"), while longer or less frequent words are split into multiple tokens ("un" + "expected" + "ly", for example). Punctuation marks and whitespace are often encoded as separate tokens as well.

On average, one token corresponds to roughly three to four characters of English text, or approximately 0.75 words. A sentence of ten words therefore typically maps to roughly 13-15 tokens. This ratio varies considerably across languages: languages with large character sets or agglutinative morphology — such as Arabic, Finnish, or many Southeast Asian languages — tend to require more tokens per word than English, which has practical implications for multilingual applications.

How Tokens Are Processed

When a user submits a prompt to an LLM, the model does not receive the raw text. Instead, a tokeniser converts the string into a list of integer identifiers drawn from the model vocabulary, which may range from a few thousand entries in early models to over 100,000 entries in more recent systems. These integers are looked up in an embedding table to produce dense vector representations, which are then passed through the layers of the transformer architecture.

The model generates output one token at a time in a process called autoregressive decoding. At each step, the model assigns a probability distribution over the entire vocabulary and samples or selects the next token. The chosen token is appended to the input, and the process repeats until a special end-of-sequence token is produced or a maximum length is reached.

Context Window and Token Limits

The context window of a model defines the maximum total number of tokens — combining both the input prompt and the generated output — that the model can consider at once. Early GPT-class models supported 2,048 tokens; contemporary models support context windows ranging from 8,192 tokens to over 1 million tokens, enabling document-length analysis and extended conversations.

Staying within the context window is a practical constraint for developers. If a conversation history or document exceeds the token limit, earlier content must be truncated or summarised, potentially causing the model to lose relevant context.

Tokens and Pricing

Commercial LLM providers — including OpenAI, Anthropic, Google, and Cohere — price their APIs on a per-token basis, typically distinguishing between input tokens (the prompt sent to the model) and output tokens (the text generated in response). Input tokens are generally cheaper than output tokens because generation requires additional computational passes. Understanding token counts is therefore essential for cost estimation and budget management when deploying LLM-based applications.

As of 2025, representative pricing for mid-tier models falls in the range of USD 0.50 to USD 5.00 per million input tokens and USD 1.50 to USD 15.00 per million output tokens, though these figures change frequently as competition intensifies.

Special Tokens

Beyond tokens representing ordinary text, LLM vocabularies include a set of special tokens that carry structural meaning. Common examples include beginning-of-sequence markers, end-of-sequence markers, padding tokens used to align batches to a uniform length, and unknown-word markers for characters outside the vocabulary. Instruction-tuned and chat models add further special tokens to delimit the roles of user, assistant, and system in multi-turn dialogues. These role-delimiter tokens are essential for the model to correctly interpret the structure of conversational input.

Tokens in Multimodal Models

As AI systems expand beyond text to handle images, audio, and video, the concept of a token generalises accordingly. Vision-language models such as GPT-4o and Gemini encode image patches as visual tokens, which are interleaved with text tokens in a shared sequence. Audio models such as Whisper convert mel-spectrogram frames into tokens before passing them to a transformer decoder. In each case, the tokenisation step serves the same function: converting a continuous signal into a discrete sequence that a transformer can process uniformly.

Malaysian Context — Tokens and Localisation

The token-centric design of LLMs has direct relevance for Malaysia's multilingual computing landscape. Malay (Bahasa Malaysia), the national language, is generally well-served by modern tokenisers because it uses the Latin script and has relatively consistent morphology. However, affixed forms such as "mempertimbangkan" may span three or more tokens. Malaysian users who write in Jawi (the Arabic-script form of Malay), Tamil, or Chinese characters face higher token counts per word, raising both cost and latency for applications that serve these communities.

Malaysian developers and AI researchers working under the MyDigital Blueprint and MDEC-supported programmes have identified tokenisation quality as a key consideration for building Bahasa Malaysia-native models. Institutions such as Universiti Malaya and Universiti Teknologi Malaysia have explored corpus-based tokenisation to improve the efficiency of Malay-language NLP systems. The HRD Corp-accredited AI training programmes offered through industry partners increasingly include modules on token budgeting and prompt optimisation as practical skills for enterprise deployments.

For businesses in Malaysia deploying LLM-based customer service, legal summarisation, or financial report analysis — sectors where BNM and SC Malaysia have issued AI-related guidance — token efficiency directly translates to operational cost. A customer support chatbot serving Malay, English, and Mandarin simultaneously may incur substantially different per-query costs depending on language mix, making token-aware prompt engineering a non-trivial engineering and financial consideration.

References

Sennrich, R., Haddow, B., and Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of ACL 2016.
Kudo, T., and Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of EMNLP 2018.
Brown, T. et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33.
OpenAI. (2023). GPT-4 Technical Report. OpenAI.
The New Stack. (2024). What Is an LLM Token: Beginner-Friendly Guide for Developers. thenewstack.io.

Tags:token tokenisation large language models nlp

Type	Fundamental unit of text
Used by	Large language models (LLMs)
Typical length	3-4 characters on average
Related	Tokenisation, context window, embedding
Significance	Determines model cost and capacity