Cross-Encoder Reranking

Updated 5 October 2025

Cross-encoder reranking is a neural retrieval technique that jointly encodes queries and candidates in a single transformer pass, enabling detailed token-level interactions.
It leverages full cross-attention between query and candidate tokens to effectively capture contextual dependencies for tasks like entity linking and text ranking.
Despite its superior accuracy, the method incurs high computational cost due to multiple transformer evaluations, motivating hybrid and efficiency-enhancing strategies.

A cross-encoder reranker is a neural retrieval or ranking module that jointly encodes a query and each candidate item in a single transformer-based forward pass, allowing full cross-attention between every token in both query and candidate. This architectural design enables modeling of intricate, token-level interactions between query context and candidate, surpassing the representational bottlenecks of independent encoders. Cross-encoder rerankers are widely employed in multi-stage pipelines across information retrieval and entity linking, often serving as a second-stage model to provide fine-grained reordering of a shortlist delivered by a more efficient (but less expressive) dual-encoder or first-stage retriever.

1. Cross-Encoder Reranking Principles

The cross-encoder paradigm fundamentally differs from dual-encoder (bi-encoder) methods. In the dual-encoder, queries and candidates (e.g., documents, passages, entities) are mapped independently into dense embedding spaces, with the ranking typically determined by similarity metrics such as cosine or dot product. In contrast, the cross-encoder concatenates the query and candidate (with boundaries like [SEP] tokens), and passes the combined sequence through a transformer to allow every token in the query to attend to every token in the candidate.

Denote the query as $q$ and the candidate as $d$ . The transformer receives as input $[CLS], q_1, ..., q_n, [SEP], d_1, ..., d_m, [SEP]$ , processes it via $\ell$ layers of self-attention, and the output hidden state at the [CLS] token $h_{\text{[CLS]}}$ is typically mapped via a scoring head to a real-valued relevance score:

$s(q, d) = W \cdot h_{\text{[CLS]}} + b$

where $W, b$ are learned parameters. Self-attention within the transformer is defined as:

$\text{Attention}(Q,K,V) = \text{softmax}((QK^\top) / \sqrt{d_k}) V$

with $Q, K, V$ matrices for queries, keys, values—here built from the whole concatenated input.

This joint encoding enables the model to capture cross-input dependencies (such as word-to-word, phrase-to-entity, or partial paraphrase alignment), which are essential for tasks where subtle context matters (e.g., disambiguating entity mentions, distinguishing between passages with similar lexical content, or re-ranking candidates based on global document evidence).

2. Architecture, Feature Engineering, and Input Construction

Cross-encoder rerankers typically support flexible and information-rich input formatting. In entity linking, the input encodes the mention, its local (e.g., full sentence) and document-level context (significant keywords), candidate entity descriptions, and potentially meta-features like the initial retrieval rank marked with special tokens (e.g., [unused0], [unused1], etc.) (Agarwal et al., 2020). For text ranking, both the query and document text are included in their entirety, sometimes supplemented by metadata or external scores appended and verbalized as statements (Upadhyay et al., 2023), e.g., "Credibility score of the document is 0.9876".

This setup allows the model to exploit inter-segment dependencies; tokens in the query portion may directly attend to relevant evidence in the candidate or document, enabling context-aware and positionally sensitive ranking decisions.

For advanced retrieval scenarios, cross-encoder architectures may be extended to process multiple candidates in a permutation-invariant or listwise way—examples include the Set-Encoder (Schlatt et al., 10 Apr 2024), where each (query, candidate) pair is encoded individually but inter-passage attention is performed via exchange of [CLS] tokens, and joint models like CROSS-JEM (Paliwal et al., 15 Sep 2024), which encode multiple candidates and the query together, leveraging token overlaps for computation savings and richer modeling of competition among candidates.

3. Training Objectives and Optimization

The main loss functions for cross-encoder rerankers are:

Contrastive loss: Enforces that relevant (positive) candidates receive higher scores than non-relevant (negative) ones, typically via a softmax over positives and sampled negatives (Pezzuti et al., 28 Mar 2025). Example:

$\mathcal{L}_\text{LCE}(q) = -\log \frac{e^{s(q, d^+)}}{e^{s(q, d^+)}+\sum_{d_i \in \mathcal{H}} e^{s(q, d_i)}}$

where $s(q, d^+)$ is the score for a positive, and $\mathcal{H}$ are hard negatives.

Binary cross-entropy or margin ranking: For tasks where a binary relevance label is available, minimizing the BCE loss between predicted and ground-truth labels is standard.
Knowledge distillation: The cross-encoder is sometimes distilled from rankings produced by larger teacher models (such as LLMs), via a RankNet or cross-entropy loss calibrated to the teacher's soft or order-based labels (Pezzuti et al., 28 Mar 2025).
Listwise ranking losses: For listwise architectures, such as CROSS-JEM or LLM-based listwise rerankers, objectives like Ranking Probability Loss (a ListNet variant) or weighted pairwise loss are used to directly optimize the ranking over lists of candidates (Paliwal et al., 15 Sep 2024, Reddy et al., 21 Jun 2024).

The optimizer selection significantly affects effectiveness and resource utilization. For models supporting long contexts (e.g., ModernBERT, GTE), the Lion optimizer has demonstrated improved NDCG@10 and GPU efficiency over AdamW (Kumar et al., 23 Jun 2025), though model–optimizer interactions are nontrivial and context dependent.

4. Efficiency, Scalability, and Computational Trade-offs

The expressive power of cross-encoder reranking comes with high computational cost. Scoring $N$ candidates per query requires $N$ transformer passes, as each query–candidate tuple must be jointly encoded. While this is tractable for reranking a shortlist (e.g., 50-200), it is impractical for large-scale retrieval.

Hybrid architectures and acceleration strategies have emerged to mitigate this:

CUR Decomposition: The full cross-encoder score matrix over all queries and candidates is approximated via CUR decomposition, enabling efficient nearest-neighbor retrieval directly based on cross-encoder scores without fallback to a dual-encoder (Yadav et al., 2022).
Uni-Encoder and Joint Modeling: Uni-Encoder concatenates the context and all candidates, applying specialized attention masking ("Arrow Attention") so that each candidate interacts only with the context and not with other candidates, encoding and scoring all at once (Song et al., 2021). Similar joint scoring approaches, such as CROSS-JEM, exploit token overlap among short candidate items to reduce redundant computation (Paliwal et al., 15 Sep 2024).
ED2LM: Converts encoder–decoder models into efficient inference pipelines by decoupling document encoding (done offline) from runtime query decoding, shifting most compute to the offline phase and reducing inference FLOPs by up to $6.8\times$ with minimal loss in ranking quality (Hui et al., 2022).
LLM-Based Listwise Reranking: LLM-based rerankers (e.g., LRL (Ma et al., 2023), FIRST (Reddy et al., 21 Jun 2024)) use a listwise prompt to jointly score candidates. FIRST improves efficiency by inferring rankings from first-token logits (rather than full sequence generation), halving latency with comparable nDCG@10.
Permutation-Invariance: Set-Encoder designs allow scoring of up to 100 passages per query efficiently, using inter-passage attention via [CLS] token exchange and fully reset positional encodings for robustness against input order (Schlatt et al., 10 Apr 2024).

5. Extensions: Multilingual, Parameter-Efficient, and Specialized Cross-Encoder Reranking

Cross-encoder rerankers have been adapted for low-resource and multilingual scenarios using parameter-efficient strategies:

Adapters and Sparse Fine-Tuning Masks (SFTMs): Rather than fine-tuning all transformer parameters, adapters or SFTMs are inserted/trained for language-specific and ranking-specific capacities. At inference, language and ranking adaptations are composed modularly, leading to improved generalization and performance in both high- and low-resource languages with reduced computation (Litschko et al., 2022).
Long-Context and Multilingual Encoders: Models such as mGTE extend the input context to 8192 tokens using rotary position embeddings and flooding-efficient methods, improving multilingual reranking for long documents (Zhang et al., 29 Jul 2024). ViRanker demonstrates cross-encoder adaptation (using RoPE and blockwise parallel transformers) specialized for the Vietnamese language, yielding strong NDCG@3 and early-rank metrics (Dang et al., 11 Sep 2025).
Cross-Modal and Image-Text Retrieval: For cross-modal scenarios, cross-encoder rerankers yield concentrated score distributions (binary match probabilities), requiring specialized distillation techniques (e.g., CPRD—Contrastive Partial Ranking Distillation) to successfully transfer hard-negative orderings to efficient dual-encoders (Chen et al., 10 Jul 2024).

6. Empirical Performance, Generalization, and Methodological Comparisons

Empirical evaluations consistently show that cross-encoder rerankers achieve state-of-the-art accuracy on entity linking (Agarwal et al., 2020), passage/document re-ranking (Déjean et al., 15 Mar 2024, Zhang et al., 29 Jul 2024), dialogue response selection (Song et al., 2021), and low-resource lexicon induction (Li et al., 2022). Key observations include:

Cross-encoders substantially boost recall@k and accuracy over first-stage or dual-encoder-only systems, with improvements on entity linking (TACKBP-2010: 92.05% accuracy with cross-encoder reranker over dual-encoder candidates (Agarwal et al., 2020)) and early-rank accuracy (e.g., ViRanker NDCG@3 of 0.6815 (Dang et al., 11 Sep 2025)).
Parameter-efficient reranker designs close the gap further in low-resource settings, with improvements of up to 8 MAP points over translation pipelines (Litschko et al., 2022).
Multi-stage fine-tuning offers no significant advantage over strong single-stage contrastive learning for cross-encoder rerankers, emphasizing the sufficiency of robust hard negative sampling and contrastive optimization (Pezzuti et al., 28 Mar 2025).
Zero-shot LLM-based rerankers approach or match the effectiveness of traditional cross-encoders in some out-of-domain benchmarks, but at greater cost; practical deployments often use hybrid pipelines (SPLADE→cross-encoder→LLM reranker) (Déjean et al., 15 Mar 2024).

7. Interpretability, Mechanistic Insights, and Practical Implications

Recent work elucidates that cross-encoder rerankers, even those based on models such as MiniLM, internally reconstruct classic IR heuristics (e.g., BM25 scoring mechanisms) (Lu et al., 7 Feb 2025). Transformer attention heads effectively compute soft term frequency—capturing synonymy and paraphrase effects—with built-in term saturation and document length normalization, while the embedding matrix encodes inverse document frequency in its dominant singular vector. The final scoring circuit can be approximated by:

$\text{score}(q, d) = \sum_{i} \left[ -U_0(q_i) + MS_\text{total}(q_i, d) + (-U_0(q_i) \cdot MS_\text{total}(q_i, d)) \right]$

where $U_0$ is the low-rank IDF vector, and $MS_\text{total}$ is the aggregate score over matching heads.

This mechanistic understanding supports model editing (e.g., personalization via $U_0$ manipulation), safety interventions, and scalable retraining, as core relevance signals are semantically interpretable and disentangled.

A plausible implication is that future cross-encoder designs may increasingly expose or control these internal circuits for greater transparency, robustness, and adaptability in real-world, safety-critical IR applications.

Summary Table: Key Cross-Encoder Reranking Variants and Features

Variant	Key Features	Performance/Context
Pointwise Cross-Encoder	Scores query–item pairs independently	High accuracy on standard ranking tasks; cost grows linearly with $k$
Joint/Listwise (e.g., Set-Encoder, CROSS-JEM)	Joint input encoding; inter-candidate interactions	Permutation invariance, lower latency, more effective in early-rank metrics
Parameter-Efficient	Adapters/SFTMs for modular multilinguality	Improved zero-shot/generalization, reduced retrain cost (Litschko et al., 2022)
LLM-based Listwise	Listwise zero-shot ranking via LLMs	Competitive accuracy, high cost; efficiency via single-token or windowing
CUR Decomposition	Score matrix factorization; query/item anchors	Scalable CE-based retrieval, superior recall-vs-cost for $k>10$ (Yadav et al., 2022)

Cross-encoder reranking remains a critical tool for accurate ranking and disambiguation across information retrieval, entity linking, and cross-lingual/multilingual tasks, with ongoing advances focused on computational efficiency, interpretability, and robustness across domains and languages.