Token
A token is the smallest unit of text processed by a large language model, typically representing a word, subword, or character used as the fundamental input and output element during inference.
A token is the fundamental unit of text that a large language model (LLM) reads and produces. Before any text can be processed by a neural network, it must be converted from a raw string into a sequence of discrete numerical identifiers — each of these identifiers corresponds to a token. The process of converting text into tokens is called tokenisation, and the reverse process is called detokenisation.
Definition and Scope
In the context of LLMs, a token does not map neatly onto a human reading unit such as a word or sentence. Instead, a token is a contiguous substring of text defined by a vocabulary that the model was trained on. Common tokenisation schemes — such as Byte Pair Encoding (BPE) or SentencePiece — learn a vocabulary of tokens by iteratively merging the most frequent character pairs in a training corpus. The result is that common English words are typically a single token ("the", "is", "run"), while longer or less frequent words are split into multiple tokens ("un" + "expected" + "ly", for example). Punctuation marks and whitespace are often encoded as separate tokens as well.
On average, one token corresponds to roughly three to four characters of English text, or approximately 0.75 words. A sentence of ten words therefore typically maps to roughly 13-15 tokens. This ratio varies considerably across languages: languages with large character sets or agglutinative morphology — such as Arabic, Finnish, or many Southeast Asian languages — tend to require more tokens per word than English, which has practical implications for multilingual applications.
How Tokens Are Processed
When a user submits a prompt to an LLM, the model does not receive the raw text. Instead, a tokeniser converts the string into a list of integer identifiers drawn from the model vocabulary, which may range from a few thousand entries in early models to over 100,000 entries in more recent systems. These integers are looked up in an embedding table to produce dense vector representations, which are then passed through the layers of the transformer architecture.
The model generates output one token at a time in a process called autoregressive decoding. At each step, the model assigns a probability distribution over the entire vocabulary and samples or selects the next token. The chosen token is appended to the input, and the process repeats until a special end-of-sequence token is produced or a maximum length is reached.
Context Window and Token Limits
The context window of a model defines the maximum total number of tokens — combining both the input prompt and the generated output — that the model can consider at once. Early GPT-class models supported 2,048 tokens; contemporary models support context windows ranging from 8,192 tokens to over 1 million tokens, enabling document-length analysis and extended conversations.
Staying within the context window is a practical constraint for developers. If a conversation history or document exceeds the token limit, earlier content must be truncated or summarised, potentially causing the model to lose relevant context.
Tokens and Pricing
Commercial LLM providers — including OpenAI, Anthropic, Google, and Cohere — price their APIs on a per-token basis, typically distinguishing between input tokens (the prompt sent to the model) and output tokens (the text generated in response). Input tokens are generally cheaper than output tokens because generation requires additional computational passes. Understanding token counts is therefore essential for cost estimation and budget management when deploying LLM-based applications.
As of 2025, representative pricing for mid-tier models falls in the range of USD 0.50 to USD 5.00 per million input tokens and USD 1.50 to USD 15.00 per million output tokens, though these figures change frequently as competition intensifies.
Special Tokens
Beyond tokens representing ordinary text, LLM vocabularies include a set of special tokens that carry structural meaning. Common examples include beginning-of-sequence markers, end-of-sequence markers, padding tokens used to align batches to a uniform length, and unknown-word markers for characters outside the vocabulary. Instruction-tuned and chat models add further special tokens to delimit the roles of user, assistant, and system in multi-turn dialogues. These role-delimiter tokens are essential for the model to correctly interpret the structure of conversational input.
Tokens in Multimodal Models
As AI systems expand beyond text to handle images, audio, and video, the concept of a token generalises accordingly. Vision-language models such as GPT-4o and Gemini encode image patches as visual tokens, which are interleaved with text tokens in a shared sequence. Audio models such as Whisper convert mel-spectrogram frames into tokens before passing them to a transformer decoder. In each case, the tokenisation step serves the same function: converting a continuous signal into a discrete sequence that a transformer can process uniformly.
References
- Sennrich, R., Haddow, B., and Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of ACL 2016.
- Kudo, T., and Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of EMNLP 2018.
- Brown, T. et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33.
- OpenAI. (2023). GPT-4 Technical Report. OpenAI.
- The New Stack. (2024). What Is an LLM Token: Beginner-Friendly Guide for Developers. thenewstack.io.