BERT Passage Re-ranking

Updated 26 February 2026

The paper demonstrates that fine-tuning a pre-trained BERT as a cross-encoder over concatenated query–passage pairs yields significant improvements in ranking accuracy, with metrics such as a jump from 16.7% to 36.5% MRR on MS MARCO.
Passage re-ranking with BERT is a technique that refines an initial set of candidates using deep, all-to-all token interactions to compute learned relevance scores, surpassing traditional term-based methods.
This approach leverages both supervised and weakly supervised fine-tuning along with efficiency optimizations like late-interaction models, offering state-of-the-art results on benchmarks like TREC Deep Learning.

Passage re-ranking with BERT refers to the process of refining an initial set of candidate passages, retrieved by a conventional information retrieval system (typically BM25 or another term-based ranker), using a BERT-based neural model that computes fine-grained, learned relevance scores for ranking. In this paradigm, BERT functions as a cross-encoder over the concatenated query–passage pairs, allowing for deep, contextual, and bidirectional interactions between the query and passage tokens. BERT-based re-ranking has established state-of-the-art effectiveness on public passage ranking benchmarks, substantially outperforming traditional lexical models.

1. Standard BERT-Based Passage Re-Ranking Architecture

The canonical approach to BERT-based passage re-ranking initializes from a pre-trained BERT model (typically BERT-Base or BERT-Large) and fine-tunes it for passage ranking using supervised or weakly supervised learning on large-scale query–passage relevance datasets. Each candidate passage is paired with the input query, and the joint token sequence is encoded as

1	[CLS] q₁ q₂ … qₙ [SEP] p₁ p₂ … pₘ [SEP]

where $q_1,\ldots,q_n$ and $p_1,\ldots,p_m$ are the WordPiece tokens of the query and the passage, respectively. Segment embeddings are used to distinguish the query from passage tokens, and the sequence is truncated to a maximum of 512 tokens, with common query lengths up to 64 and passages filling the remainder (Zerveas et al., 2020, Nogueira et al., 2019).

The output is the BERT final hidden state at the [CLS] position, $h_{[\mathrm{CLS}]} \in \mathbb{R}^H$ . A ranking head—typically a single fully-connected layer with optional nonlinearity—projects this vector to a scalar relevance score:

$s(q, p) = \sigma(w^\top h_{[\mathrm{CLS}]} + b)$

where $\sigma(\cdot)$ is the logistic sigmoid when optimizing with binary classification losses (Zerveas et al., 2020, Zhuang et al., 2021, Padigela et al., 2019). The final passage ranking for a query is obtained by sorting candidate passages in descending order of $s(q,p)$ .

2. Training Paradigms and Loss Functions

Fine-tuning for passage re-ranking employs either a supervised objective using labeled relevance judgments or weak supervision by aggregating multiple noisy label sources. For binary relevance, the cross-entropy loss is standard:

$\mathcal{L}(q, p) = - \left[ y \log s(q, p) + (1-y) \log (1 - s(q, p)) \right]$

where $y \in \{0, 1\}$ denotes the binary relevance label (Zerveas et al., 2020, Padigela et al., 2019, Nogueira et al., 2019).

Weakly supervised setups generate pseudo-labels using combinations of unsupervised retrieval models (e.g., BM25, TF-IDF), representational similarities from models such as Universal Sentence Encoder or a frozen BERT, and aggregate them using a generative model to form probabilistic or majority-vote pseudo-labels. BERT is then fine-tuned using these pseudo-labels, typically with a pairwise hinge loss:

$\ell(q, p_+, p_-; \theta) = \max \{ 0, \epsilon - [S(q, p_+; \theta) - S(q, p_-; \theta)] \}$

where $(q, p_+)$ and $(q, p_-)$ are positive and negative pairs defined by the weak supervision framework (Xu et al., 2019).

Learning-to-rank objectives can be pointwise, pairwise, or listwise (e.g., softmax cross-entropy over lists of candidate passages), and explicit listwise ranking implementations have been shown to provide marginal improvements and further gains when ensembled (Han et al., 2020).

Hyperparameters—including learning rates ($2$– $5 \times 10^{-5}$ ), batch size (16–128 pairs per accelerator), and number of epochs (1–4)—are selected within ranges typical for BERT fine-tuning, with validation-based early stopping (Zerveas et al., 2020, Nogueira et al., 2019, Mass et al., 2019).

3. Pipeline Design and Efficiency Considerations

Because BERT’s cross-encoder computation is expensive, passage re-ranking is universally applied as a second-stage step: only a fixed set of top-K candidate passages (typically K=1000) per query is re-scored by the neural model, following a fast retrieval (Zerveas et al., 2020, Nogueira et al., 2019, Rau et al., 2022). This design keeps compute requirements tractable and supports batch inference on modern GPUs or TPUs.

In practice, the pipeline operates as:

Retrieve top-K candidates for the query using BM25 or a similar ranker.
Form joint [CLS]–query–[SEP]–passage–[SEP] inputs for each $(q, d_k)$ pair.
Run BERT forward passes to obtain scores $s(q, d_k)$ .
Rank the candidates by $s(q, d_k)$ .

Empirical studies show this approach is dominant in large-scale evaluation settings such as MS MARCO (~8.8M passages, ~1M queries), TREC Deep Learning Track, and TREC-CAR (Zerveas et al., 2020, Nogueira et al., 2019).

Efficiency remains a key concern, driving the development of late-interaction models such as ColBERT (which decouple encoding and use max-sim aggregation), hybrid retrieval pipelines, and sparse-contextual matchers aiming to close the gap between effectiveness and runtime cost (Khattab et al., 2020, Zhuang et al., 2021).

4. Empirical Results and Effectiveness

Fine-tuned BERT cross-encoders consistently yield state-of-the-art performance for passage re-ranking, providing >27% relative improvements over competitive neural models (e.g., IRNet) and more than doubling MRR@10 compared to BM25 on MS MARCO dev set (BERT-Large: 36.5% vs BM25: 16.7%) (Nogueira et al., 2019, Zerveas et al., 2020). On TREC Deep Learning 2019, a BERT re-ranker achieved 2nd among reranking submissions and 3rd overall; for nDCG and MAP, 70–75% of topics were in the median-to-best range (Zerveas et al., 2020).

These gains extend to scenarios with weak supervision; BERT-PR models trained solely on noisy pseudo-labels outperform BM25 by large margins and can even surpass SOTA fully supervised models on some datasets (Xu et al., 2019).

BERT’s semantic matching capabilities—enabled by deep, all-to-all cross-attention—allow for strong paraphrase and meaning-based retrieval, as well as high tolerance to surface-level variation between queries and relevant passages (Rau et al., 2022, Padigela et al., 2019, Qiao et al., 2019). However, for longer queries or those dominated by entity/numerical constraints, performance gains may taper, and hybrid scoring remains advisable (Padigela et al., 2019, Askari et al., 2023).

5. Model Variants and Recent Innovations

Cross-Encoder Variants

Standard Cross-Encoder: Concatenate and jointly encode (q,p); project [CLS] with a linear or MLP head for scoring (Zerveas et al., 2020, Nogueira et al., 2019, Zhuang et al., 2021).
Interaction-based Extensions: Multi-layer interaction/fusion or token-level translation (e.g., Term-Trans) have been studied, but gains over the [CLS]-linear setup are modest (Qiao et al., 2019).
Listwise Ranking: Incorporating listwise loss in the ranking head (e.g., TF-Ranking with softmax cross-entropy over groups of passages) marginally improves metrics and supports efficient ensembling (Han et al., 2020).
BM25-Injection: Explicitly appending the lexical retrieval score as text to the BERT input yields consistent, statistically significant gains over both vanilla BERT and linear interpolation approaches (Askari et al., 2023).

Efficiency-Oriented and Hybrid Models

ColBERT: Late-interaction, decomposing full cross-encoding to independently encode query and document, projecting to low-dim token embeddings, and using MaxSim+sum token-matching—yielding ~170x latency reduction with only minor effectiveness loss (Khattab et al., 2020).
TILDEv2: Contextualized exact term matching with efficient passage expansion, using only sparse query-passage overlaps and storing learned per-term weights, allows for CPU-only re-ranking with sub-100ms latency and index reduction by 99% compared to full dense representations (Zhuang et al., 2021).
Intra-Document Cascade: Hierarchical models use fast, low-cost passage-level models (student) to prune candidates before a small set is re-ranked by expensive cross-encoder BERT (teacher), maintaining effectiveness with 4x lower latency (Hofstätter et al., 2021).

6. Practical Issues, Limitations, and Robustness

Robustness to Input Noise: BERT re-rankers suffer reductions in effectiveness in the presence of typos or hard keyword mismatch, but explicitly augmenting the training data with typographical corruptions can substantially regain robustness without loss on clean queries (Zhuang et al., 2021).
Handling Long Passages: Sequence length limitations (≤512 tokens) require either truncation, fixed-window sliding, or hierarchical attention over chunks. Best overall performance on non-factoid re-ranking is observed with a truncation length of 256; for longer passages, chunk-and-attend strategies are effective with only marginal performance degradation (Mass et al., 2019).
Label Noise and Supervision: Passage-level relevance labels can suffer significant noise when transferred directly from document-level labels, especially for longer documents. Weak-supervision methods that filter or re-label using a BERT QA model avoid this issue, improving both effectiveness and training efficiency (Rudra et al., 2021).
Aggregation Strategies: In document re-ranking with passage-level BERT scoring, aggregating the passage representations (e.g., via Transformer/CNN pooling across passage vectors) recovers global context, which is essential for broad-relevance tasks. For pinpoint retrieval, simple max-pooling suffices (Li et al., 2020, Leonhardt et al., 2021).

7. Theoretical Insights and Analysis

A series of ablation and analysis studies indicate that BERT’s gains in passage re-ranking derive not from explicit modeling of syntax, but from deep cross-attention that allows for powerful, contextual, bag-of-words matching within high-dimensional embeddings. Disrupting token order or ablating position encodings has minimal impact on effectiveness after adaptation, confirming an order-invariant, context-driven matching mechanism as the principal source of gain over term-based rankers (Rau et al., 2022, Qiao et al., 2019).

Fine-tuning mostly benefits the [CLS] representation and scoring head; explicit modeling of sentence- or token-level representations (e.g., via sentence-level DMNs) can further boost performance and provide training/inference efficiency via parameter “freezing” (Leonhardt et al., 2021).

Empirical and interpretability studies underscore that BERT cross-encoders are highly sensitive to a few critical matching terms and that non-interaction-based architectures (e.g., dual encoder with separate [CLS] embeddings) perform near-random on passage re-ranking (Qiao et al., 2019, Padigela et al., 2019).

References:

(Zerveas et al., 2020): Brown University at TREC Deep Learning 2019
(Xu et al., 2019): Passage Ranking with Weak Supervision
(Nogueira et al., 2019): Passage Re-ranking with BERT
(Rau et al., 2022): The Role of Complex NLP in Transformers for Text Ranking?
(Han et al., 2020): Learning-to-Rank with BERT in TF-Ranking
(Khattab et al., 2020): ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
(Zhuang et al., 2021): Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion
(Hofstätter et al., 2021): Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking
(Zhuang et al., 2021): Dealing with Typos for BERT-based Passage Retrieval and Ranking
(Qiao et al., 2019): Understanding the Behaviors of BERT in Ranking
(Mass et al., 2019): A Study of BERT for Non-Factoid Question-Answering under Passage Length Constraints
(Li et al., 2020): PARADE: Passage Representation Aggregation for Document Reranking
(Rudra et al., 2021): An In-depth Analysis of Passage-Level Label Transfer for Contextual Document Ranking
(Askari et al., 2023): Injecting the BM25 Score as Text Improves BERT-Based Re-rankers
(Padigela et al., 2019): Investigating the Successes and Failures of BERT for Passage Re-Ranking
(Leonhardt et al., 2021): Exploiting Sentence-Level Representations for Passage Ranking