AIWiki
Malaysia

BM25

BM25 (Best Matching 25) is a probabilistic ranking function used in information retrieval that scores documents based on query term frequency, inverse document frequency, and document length normalisation.

7 min readLast updated June 2026Foundations

BM25 (Best Matching 25) is a probabilistic ranking function used in information retrieval to rank documents in response to a query. Developed by Stephen Robertson, Karen Sparck Jones, and colleagues at City University London and Microsoft Research in the 1990s, BM25 remains one of the most widely deployed retrieval algorithms in production search systems three decades after its introduction. It is the default ranking function in Apache Lucene and, by extension, Elasticsearch and OpenSearch — the backbone of enterprise search infrastructure worldwide.

Historical Context

BM25 belongs to the family of probabilistic retrieval models rooted in the Probability Ranking Principle (PRP), which states that documents should be ranked in decreasing order of the probability of relevance given the query. The Okapi BM25 variant, which gave the algorithm its name, was first deployed in the Okapi information retrieval system developed at City University London and received its current name from the 25th iteration of experiments refining its parameters.

Before BM25, TF-IDF was the dominant retrieval function. TF-IDF scores a document based on the product of term frequency — how often a query term appears in the document — and inverse document frequency — how rare the term is across all documents. BM25 extends this foundation by introducing two important modifications that improve ranking quality on real-world corpora.

Scoring Formula

The BM25 score for a document D given query Q is the sum over query terms of per-term relevance scores. For each term q_i, the contribution is the product of its inverse document frequency and a saturated, length-normalised term frequency component. The normalised TF component saturates as term frequency increases, controlled by parameter k1 (typically 1.2 to 2.0). The length normalisation is governed by parameter b (typically 0.75), which penalises documents longer than the corpus average and rewards shorter, focused documents that contain the query term.

The IDF component is computed as log((N - n(q_i) + 0.5) / (n(q_i) + 0.5) + 1), where N is the total number of documents and n(q_i) is the number of documents containing term q_i. Rare terms receive high IDF values; ubiquitous terms receive low IDF values.

Key Innovations over TF-IDF

Term frequency saturation is the first key innovation. In vanilla TF-IDF, the contribution of a term grows linearly with its frequency in a document — a document mentioning a term one hundred times scores one hundred times higher than one mentioning it once. BM25 introduces diminishing returns: each additional occurrence of a term contributes less than the previous one, eventually saturating. This prevents documents that artificially repeat query terms from dominating rankings.

Document length normalisation is the second key innovation. Without normalisation, longer documents score higher simply because they are more likely to contain any given term by chance. BM25 normalises by the ratio of document length to average document length, penalising unusually long documents and rewarding concise, focused matches.

BM25 Variants

Several variants of BM25 address specific retrieval scenarios. BM25+ adds a lower bound on term frequency weight to ensure that documents containing a query term always receive some credit, even for very long documents where the length normalisation would otherwise suppress the score. BM25F extends BM25 to structured documents with multiple fields — title, body, abstract — allowing different weights and length normalisations for each field. BM25-Adpt adaptively sets k1 per term based on corpus statistics rather than using a fixed global value.

Strengths and Limitations

BM25 is computationally efficient because it operates on pre-built inverted indices — data structures that map each vocabulary term to the list of documents containing it. Retrieval involves looking up query terms in the index and computing scores using stored frequency statistics, which is orders of magnitude faster than computing dense vector comparisons across large corpora. BM25 requires no GPU, no pre-trained neural model, and no embedding computation, making it deployable on minimal infrastructure.

BM25 excels at exact keyword matching, product codes, proper nouns, identifiers, and technical terms that must appear verbatim. It is transparent and interpretable — it is straightforward to explain why a particular document ranked highly. It is language-agnostic and works for any tokenisable language without language-specific pre-training.

However, BM25 has well-known limitations. It cannot capture semantic equivalence: a query for "automobile" will not retrieve documents about "cars" unless there is lexical overlap. It treats each term independently and ignores word order and phrase-level semantics. It cannot handle morphological variation unless stemming or lemmatisation is applied in preprocessing. These limitations motivate the use of BM25 as the sparse component in hybrid search pipelines, where it is combined with dense semantic retrieval to compensate for its vocabulary mismatch problem.

Implementation

BM25 is implemented in Apache Lucene, the foundational Java library underlying Elasticsearch, OpenSearch, and Apache Solr. Python implementations include rank_bm25 and BM25Okapi in the nltk-based ecosystem. Pyserini provides a Python interface to Lucene-based BM25 indexing and retrieval, widely used in academic information retrieval research. Most vector databases that support hybrid search maintain an internal BM25 sparse index alongside the dense vector index.

See Also

References

  1. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1994). Okapi at TREC-3. NIST Special Publication 500-225.
  2. Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
  3. Lin, J., & Nogueira, R. (2021). Pretrained Transformers for Text Ranking: BERT and Beyond. Synthesis Lectures on Human Language Technologies.
  4. Kamphuis, C., de Vries, A. P., Boytsov, L., & Lin, J. (2020). Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. ECIR 2020.
  5. Apache Software Foundation. (2024). Apache Lucene Documentation: Similarity. Apache Lucene Project.