AIWiki
Malaysia

Reranking

Reranking is a two-stage information retrieval technique in which a fast first-stage retriever generates candidate documents, and a more accurate but computationally expensive model re-scores and reorders them.

6 min readLast updated June 2026Applications

Reranking is a two-stage information retrieval strategy in which a computationally inexpensive first-stage retriever — such as BM25 or a bi-encoder dense retriever — rapidly selects a candidate set of potentially relevant documents from a large corpus, and a more accurate but slower second-stage model re-scores and reorders that candidate set to produce a final ranked list. Reranking decouples the scalability requirements of retrieval from the accuracy requirements of relevance scoring, enabling production systems to apply expensive relevance models at inference time without searching the entire document corpus.

Motivation

Large document corpora make exhaustive pairwise relevance scoring impractical. A cross-encoder model that jointly processes a query and a document to compute their relevance score achieves high accuracy but requires a forward pass through a large Transformer for each query-document pair. At corpus scale, this is computationally infeasible during real-time search. First-stage retrievers, by contrast, use pre-computed document representations and approximate nearest-neighbour algorithms to retrieve thousands of candidates in milliseconds, but at the cost of lower precision.

The two-stage pipeline reconciles these constraints: the first stage reduces the search space from millions to tens or hundreds of candidates, and the reranker applies its more accurate scoring only to that small set, incurring manageable latency.

First-Stage Retrievers

First-stage retrieval is typically performed by either sparse retrievers or dense bi-encoders.

BM25 is the canonical sparse retriever, scoring documents using term frequency, inverse document frequency, and document length normalisation. It is fast, interpretable, and requires no GPU.

Bi-encoder dense retrievers independently embed queries and documents into a shared vector space and retrieve by approximate nearest-neighbour search. Models such as DPR (Dense Passage Retrieval), Contriever, and E5 are common choices. Because query and document encodings are computed independently, document embeddings can be pre-computed and indexed, allowing fast retrieval at query time.

Hybrid search, which combines BM25 and dense retrieval via Reciprocal Rank Fusion, is increasingly used as the first stage to maximise recall of the candidate set passed to the reranker.

Cross-Encoder Rerankers

The dominant reranking architecture is the cross-encoder. A cross-encoder concatenates the query and the candidate document as a single input sequence — typically formatted as [CLS] query [SEP] document [SEP] — and passes it through a Transformer encoder. The [CLS] token embedding is projected to a scalar relevance score. Because the query and document are processed jointly, the model can capture fine-grained token-level interactions between query terms and document content, a capability that bi-encoders, which encode each independently, cannot achieve.

Cross-encoders are typically initialised from pre-trained language models such as BERT, RoBERTa, or DeBERTa and fine-tuned on labelled relevance datasets. MS MARCO, a large-scale dataset of Bing search queries with passage relevance labels, is the most widely used training resource. Models fine-tuned on MS MARCO include the monoT5 series, the cross-encoder/ms-marco family on Hugging Face, and Cohere Rerank.

LLM-Based Reranking

Large language models have been applied to reranking through listwise and pointwise approaches. In pointwise LLM reranking, the model is prompted to judge the relevance of each candidate document to the query, producing a relevance score or binary judgement. In listwise reranking, the model receives the full candidate list and is asked to output a reordered ranking. Research has shown that LLMs such as GPT-4 can serve as zero-shot rerankers competitive with fine-tuned cross-encoders on some benchmarks, at substantially higher cost per query.

Role in RAG Pipelines

Reranking has become a standard component of retrieval-augmented generation (RAG) systems. A RAG pipeline retrieves k candidate documents, optionally fusing BM25 and dense retrieval results, and passes them to a reranker that selects the top-n most relevant for inclusion in the language model prompt. Because most language models have fixed context windows, the quality of the top-n documents directly affects the factual accuracy of generated answers. Studies on open-domain question answering have shown that adding a cross-encoder reranker between retrieval and generation reduces hallucination rates and improves answer correctness.

The latency of cross-encoder reranking depends on model size and candidate set size. Typical production rerankers score 50–100 candidates in 50–200 milliseconds on a modern GPU, which is acceptable for most interactive applications.

Commercial Reranking Services

Several AI providers offer reranking as a managed API service. Cohere Rerank is widely used in enterprise RAG deployments and supports multilingual reranking across over 100 languages. Jina AI offers an open-weight jina-reranker family optimised for long documents. NVIDIA provides a reranking microservice within its NIM inference platform. These services allow teams to add reranking to existing search pipelines without maintaining their own model infrastructure.

See Also

References

  1. Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.
  2. Nogueira, R., Yang, W., Lin, J., & Cho, K. (2020). Document Ranking with a Pretrained Sequence-to-Sequence Model. Findings of EMNLP 2020.
  3. Thakur, N., et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021 Datasets Track.
  4. Cohere. (2025). Cohere Rerank API Documentation. Cohere Inc.
  5. NVIDIA. (2025). Reranking Microservice in NVIDIA NIM. NVIDIA Corporation.