What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

BM25

BM25 (Best Matching 25) is a probabilistic ranking function used in information retrieval that scores documents based on query term frequency, inverse document frequency, and document length normalisation.

7 min readLast updated June 2026Foundations

BM25 (Best Matching 25) is a probabilistic ranking function used in information retrieval to rank documents in response to a query. Developed by Stephen Robertson, Karen Sparck Jones, and colleagues at City University London and Microsoft Research in the 1990s, BM25 remains one of the most widely deployed retrieval algorithms in production search systems three decades after its introduction. It is the default ranking function in Apache Lucene and, by extension, Elasticsearch and OpenSearch — the backbone of enterprise search infrastructure worldwide.

Historical Context

BM25 belongs to the family of probabilistic retrieval models rooted in the Probability Ranking Principle (PRP), which states that documents should be ranked in decreasing order of the probability of relevance given the query. The Okapi BM25 variant, which gave the algorithm its name, was first deployed in the Okapi information retrieval system developed at City University London and received its current name from the 25th iteration of experiments refining its parameters.

Before BM25, TF-IDF was the dominant retrieval function. TF-IDF scores a document based on the product of term frequency — how often a query term appears in the document — and inverse document frequency — how rare the term is across all documents. BM25 extends this foundation by introducing two important modifications that improve ranking quality on real-world corpora.

Scoring Formula

The BM25 score for a document D given query Q is the sum over query terms of per-term relevance scores. For each term q_i, the contribution is the product of its inverse document frequency and a saturated, length-normalised term frequency component. The normalised TF component saturates as term frequency increases, controlled by parameter k1 (typically 1.2 to 2.0). The length normalisation is governed by parameter b (typically 0.75), which penalises documents longer than the corpus average and rewards shorter, focused documents that contain the query term.

The IDF component is computed as log((N - n(q_i) + 0.5) / (n(q_i) + 0.5) + 1), where N is the total number of documents and n(q_i) is the number of documents containing term q_i. Rare terms receive high IDF values; ubiquitous terms receive low IDF values.

Key Innovations over TF-IDF

Term frequency saturation is the first key innovation. In vanilla TF-IDF, the contribution of a term grows linearly with its frequency in a document — a document mentioning a term one hundred times scores one hundred times higher than one mentioning it once. BM25 introduces diminishing returns: each additional occurrence of a term contributes less than the previous one, eventually saturating. This prevents documents that artificially repeat query terms from dominating rankings.

Document length normalisation is the second key innovation. Without normalisation, longer documents score higher simply because they are more likely to contain any given term by chance. BM25 normalises by the ratio of document length to average document length, penalising unusually long documents and rewarding concise, focused matches.

BM25 Variants

Several variants of BM25 address specific retrieval scenarios. BM25+ adds a lower bound on term frequency weight to ensure that documents containing a query term always receive some credit, even for very long documents where the length normalisation would otherwise suppress the score. BM25F extends BM25 to structured documents with multiple fields — title, body, abstract — allowing different weights and length normalisations for each field. BM25-Adpt adaptively sets k1 per term based on corpus statistics rather than using a fixed global value.

Strengths and Limitations

BM25 is computationally efficient because it operates on pre-built inverted indices — data structures that map each vocabulary term to the list of documents containing it. Retrieval involves looking up query terms in the index and computing scores using stored frequency statistics, which is orders of magnitude faster than computing dense vector comparisons across large corpora. BM25 requires no GPU, no pre-trained neural model, and no embedding computation, making it deployable on minimal infrastructure.

BM25 excels at exact keyword matching, product codes, proper nouns, identifiers, and technical terms that must appear verbatim. It is transparent and interpretable — it is straightforward to explain why a particular document ranked highly. It is language-agnostic and works for any tokenisable language without language-specific pre-training.

However, BM25 has well-known limitations. It cannot capture semantic equivalence: a query for "automobile" will not retrieve documents about "cars" unless there is lexical overlap. It treats each term independently and ignores word order and phrase-level semantics. It cannot handle morphological variation unless stemming or lemmatisation is applied in preprocessing. These limitations motivate the use of BM25 as the sparse component in hybrid search pipelines, where it is combined with dense semantic retrieval to compensate for its vocabulary mismatch problem.

Implementation

BM25 is implemented in Apache Lucene, the foundational Java library underlying Elasticsearch, OpenSearch, and Apache Solr. Python implementations include rank_bm25 and BM25Okapi in the nltk-based ecosystem. Pyserini provides a Python interface to Lucene-based BM25 indexing and retrieval, widely used in academic information retrieval research. Most vector databases that support hybrid search maintain an internal BM25 sparse index alongside the dense vector index.

Malaysian Context — BM25 in Malaysian Search Infrastructure

BM25 forms the retrieval backbone of many Malaysian enterprise and government search systems, largely through deployments of Elasticsearch and OpenSearch, both of which use Lucene's BM25 implementation as their default ranking function. Elasticsearch is widely deployed in Malaysia across e-commerce, financial services, and government sectors for powering search functionality in internal applications and customer-facing portals.

Malaysian e-commerce platforms, including Lazada Malaysia, Shopee Malaysia, and local marketplace operators, rely on inverted index retrieval — BM25 at its core — for product search, where exact matching of product names, SKUs, model numbers, and brand names is essential. A customer searching for a specific phone model must retrieve documents containing that exact model string, a task where BM25's precision advantages over purely semantic retrieval are pronounced.

Government digital services in Malaysia, including MyGovUC and portals operated by agencies under the Malaysia Communications and Multimedia Commission (MCMC) and the Economic Planning Unit (EPU), use document retrieval systems where BM25-based search provides the foundation for locating forms, regulations, and procedural guides. The transition from pure BM25 to hybrid search augmented with neural embeddings is an ongoing modernisation effort aligned with the MyDigital Blueprint's emphasis on AI-enhanced public services.

The challenge of BM25 in the Malaysian context is shared with all Malay-language retrieval: Bahasa Malaysia is an agglutinative language with rich morphological variation, and BM25's term-matching approach benefits from effective tokenisation and normalisation. Research groups at Universiti Teknologi Malaysia and Universiti Malaya have contributed to Malay tokenisers and stemmers that improve BM25 retrieval quality for Malay text. The National Language Council (Dewan Bahasa dan Pustaka) maintains authoritative Malay vocabulary resources that inform such normalisation pipelines.

References

Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1994). Okapi at TREC-3. NIST Special Publication 500-225.
Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
Lin, J., & Nogueira, R. (2021). Pretrained Transformers for Text Ranking: BERT and Beyond. Synthesis Lectures on Human Language Technologies.
Kamphuis, C., de Vries, A. P., Boytsov, L., & Lin, J. (2020). Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. ECIR 2020.
Apache Software Foundation. (2024). Apache Lucene Documentation: Similarity. Apache Lucene Project.

Tags:information-retrieval search nlp bm25 sparse-retrieval

Full name	Best Matching 25
Type	Probabilistic Ranking Function
Proposed by	Robertson et al., 1994
Parameters	k1 (term saturation), b (length normalisation)
Implementations	Elasticsearch, Lucene, OpenSearch, rank_bm25
Related	TF-IDF, Hybrid Search, Semantic Search, RAG