Bi-Encoder and Cross-Encoder Architectures

Updated 1 September 2025

Bi-encoder and cross-encoder architectures are neural models that independently encode or jointly process paired inputs, balancing efficiency and interaction depth.
Bi-encoders precompute embeddings for scalable retrieval while cross-encoders fuse inputs for detailed cross-attention and improved accuracy.
Hybrid strategies, including knowledge distillation and cascaded architectures, combine the strengths of both models to enhance retrieval performance.

A bi-encoder (also referred to as a dual-encoder or twin-encoder) and a cross-encoder are foundational neural architectures for modeling relationships between pairs (or sets) of inputs—such as sentences, queries and documents, or image–text pairs—in information retrieval, natural language processing, and multi-modal learning. The distinction between these approaches has significant implications for computational cost, modeling power, scalability, and transferability across domains.

1. Architectural Principles and Definitions

Bi-encoders consist of two encoders (often identical or with tied parameters) that map each member of a pair—such as a query and a document or two sentences—independently into embedding vectors. The similarity between these embeddings, typically measured using a dot product or cosine similarity, serves as the model’s output:

$u = f(q),\qquad v = f(a)$

$s(q, a) = u^\top v \quad \text{or} \quad s(q, a) = \frac{u^\top v}{\|u\|\|v\|}$

The cross-encoder, by contrast, concatenates both inputs into a single sequence (or otherwise fuses them), processes them jointly (e.g., via transformer attention across both), and outputs a score based on the joint representation—typically via the [CLS] token or a projection layer.

Key architectural distinctions:

Aspect	Bi-Encoder	Cross-Encoder
Input handling	Inputs encoded independently	Inputs fused and encoded jointly
Similarity metric	Dot/cosine product of embeddings	Score from joint network output
Interaction scope	Only at retrieval/scoring stage	Full attention over both inputs
Efficiency	High (embeddings can be precomputed/indexed)	Low (must process every pair)
Modeling power	Limited to what can be captured in embeddings	High (captures fine-grained interaction)

2. Training Objectives and Multi-Tasking

Bi-encoder models are typically trained using contrastive or softmax-based losses over similarity scores between pairs. For instance, given a batch of queries $q_i$ and associated positive/negative candidates $a_i$ , a softmax over similarity scores is computed:

$P(a|q) = \frac{\exp(s(q, a))}{\sum_{a'} \exp(s(q, a'))}$

The cross-entropy loss over correct pairs is minimized:

$L = -\sum_{(q, a) \in D} \log P(a|q)$

Bi-encoders have also been extended to multi-task learning frameworks, where several related objectives (e.g., semantic retrieval, translation-pair ranking, natural language inference) are optimized jointly. Tasks such as translation-based bridge tasks enable aligning semantically similar texts across languages in a shared vector space, which is essential for multilingual or cross-modal models (Yang et al., 2019).

Cross-encoders use regression or classification losses directly on the output of the joint encoding (e.g., classification via [CLS] for semantic similarity or ranking).

3. Empirical Trade-Offs and Limitations

Efficiency vs. Modeling Power

Bi-encoders are highly scalable—embeddings for one or both sides (e.g., documents) can be precomputed and stored, allowing sub-millisecond retrieval over millions of candidates using approximate nearest neighbor search. This makes them ideal for first-stage retrieval in large-scale systems (Yang et al., 2019).
Cross-encoders are computationally expensive, as every query–candidate pair must be processed at inference. This cost can become prohibitive for large candidate sets, restricting practical use to reranking of a small list or cases where input pairs are few (Rosa et al., 2022).

Retrieval Effectiveness

Cross-encoders consistently achieve higher accuracy and relevance ranking, particularly for fine-grained interactions that require token-level attention across both inputs. Empirical studies confirm that, on both in-domain and especially out-of-domain (zero-shot) settings, cross-encoders outperform bi-encoders by a margin of 4+ points nDCG@10 over state-of-the-art dense bi-encoders (Rosa et al., 2022). The superiority is attributed to early and effective query–candidate interactions that bi-encoders cannot model.

Generalization and Zero-Shot Transfer

Out-of-domain generalization is a persistent challenge for bi-encoders. While model scaling benefits both architectures to an extent, cross-encoders derive much larger gains in generalization via richer attention-based representations (Rosa et al., 2022). In zero-shot retrieval, it is found that even strong bi-encoder retrievers offer no substantial advantage over traditional keyword-based methods (e.g., BM25) as first-stage candidates—contrasting with the much larger improvements provided by strong cross-encoder rerankers in the second stage.

4. Knowledge Distillation and Hybrid Methods

To harness bi-encoder efficiency and cross-encoder modeling power, recent research develops hybrid approaches that distill the richer knowledge of cross-encoders into bi-encoder students. Losses for distillation include:

Logit distillation (matching soft label distributions): Less effective for bi-encoders because their similarity scores are spread out (almost normal distribution), while cross-encoder outputs are sharply peaked near 0/1 (Chen et al., 10 Jul 2024).
Ranking distillation: Emphasizes alignment in the ranking of hard negatives. Only the order of hard negatives (those with highest cross-encoder similarity among negatives) is meaningful for distillation; the order among easy negatives is less informative and may even be harmful if forced (Chen et al., 10 Jul 2024).
Partial contrastive ranking distillation (CPRD): A novel loss that selectively enforces the cross-encoder’s ranking on valid hard negatives, with the formula

$\mathcal{L}_{ij} = -\log \frac{\exp(v_i^\top t_{\mathbf{c}_{ij}} / \tau)}{\sum_{k\geq j} \exp(v_i^\top t_{\mathbf{c}_{ik}}/\tau) + \sum_{k>K} \exp(v_i^\top t_{\mathbf{d}_{ik}} / \tau)}$

where only negatives above a certain similarity threshold from the cross-encoder teacher influence the ranking order enforced on the student dual-encoder (Chen et al., 10 Jul 2024).

Methods such as TRMD (two-ranker multi-teacher distillation) further combine information from both a strong bi-encoder and a cross-encoder teacher by training the student with multiple ranking heads and additional MSE loss terms on intermediate representations (Choi et al., 2021).

Experimental results consistently show that distillation methods targeting the ranking of hard negatives lead to significant improvement in bi-encoder retrieval accuracy, closing a portion of the gap to cross-encoders on standard benchmarks (Chen et al., 10 Jul 2024, Choi et al., 2021).

5. Extensions, Cascades, and Hybrid Architectures

Bi-encoder cascades: To address the computation–quality trade-off, cascades of bi-encoders of increasing quality/cost (e.g., (Hönig et al., 2023)) are proposed. The system first ranks using a small encoder, then reranks a smaller pool with a higher-quality (expensive) encoder. The “p-small-world” hypothesis is exploited: only a small fraction $p$ of the corpus appears in top- $m$ candidate lists over a system’s lifetime, permitting substantial computation savings. Mathematical analysis establishes lifetime cost and latency reduction factors, for instance:

$F_{\text{life}} = \frac{c_{\text{large}}}{c_{\text{small}} + p \sum_j c_j}$

Dual–cross encoder hybridization: Systems like LoopITR jointly train both a dual encoder and a cross encoder, with a feedback (“loop”) mechanism where the stronger cross encoder distills knowledge (typically over mined hard negatives) into the dual encoder (Lei et al., 2022).
Chunk- or component-aware bi-encoders: Specialized architectures decompose input into meaningful chunks (such as address components in geographic reranking) and add learned attention matrices over chunk representations. Asynchronous update mechanisms may be employed to accelerate the learning of chunk importance (Cao et al., 2023).
Encoding–searching separation: A recently articulated perspective separates the encoding and searching modules. Under this model, the encoder is optimized to produce generic, task-agnostic representations, while an explicit searching module adapts these for specific search tasks. This mitigates the “encoding-for-search” bottleneck and enhances modularity, transferability, and robustness (Tran et al., 2 Aug 2024):

$P(\text{rel}=1|q,d)=s(\langle f_\text{search}(\text{enc}_1'(q)), \text{enc}_2(d) \rangle)$

6. Applications and Domain-Specific Designs

Bi- and cross-encoder methods are used in numerous domains:

Multilingual semantic retrieval: Multi-task trained dual-encoders (e.g., Multilingual Universal Sentence Encoder) leverage translation-based bridge tasks to embed sentences from multiple languages into a unified space (Yang et al., 2019).
Matching job candidates to vacancies: Multilingual bi-encoder BERT models with cosine similarity log loss are applied for semantic CV–vacancy matching, overcoming language and style disparities (Lavi, 2021).
Named entity recognition: Bi-encoder frameworks for NER map candidate spans and entity types into a joint embedding space and introduce dynamic thresholding losses for robust entity-class discrimination, especially under partial supervision or for overlapping entities (Zhang et al., 2022).
Image–text retrieval: Dual/cross encoder hybrids, cascade methods, and distillation frameworks improve scalability and retrieval precision in multi-modal search (Lei et al., 2022, Hönig et al., 2023, Chen et al., 10 Jul 2024).
Out-of-distribution detection: Bi-encoder-based detectors, trained with cosine similarity losses on in-domain/ood pairs, outperform other methods across standard OOD benchmarks, demonstrating a favorable efficiency–robustness trade-off (Owen et al., 2023).
Canonical relation extraction: Bi-encoder-decoder architectures improve both the quality of learned entity representations and the ability to handle novel entities via direct textual encoding (Zheng et al., 2023).
Permutation-invariant passage re-ranking: Set-Encoder augments cross-encoders with inter-passage attention over [CLS] tokens, ensuring permutation invariance and improved ranking stability at high scale (Schlatt et al., 10 Apr 2024).

7. Current Limitations and Future Directions

Despite being a workhorse for large-scale retrieval, bi-encoders remain constrained by their inability to model fine-grained inter-input interaction, the “encoding-for-search” limitation, and lower zero-shot robustness relative to cross-encoders on out-of-domain tasks (Rosa et al., 2022, Tran et al., 2 Aug 2024). Distillation from cross-encoder teachers closes the gap but not fully; residual performance deficits on key metrics such as accuracy and mean nDCG persist.

Recent advances—in particular, encoding–searching separation (Tran et al., 2 Aug 2024), partial contrastive ranking distillation (Chen et al., 10 Jul 2024), and robust permutation-invariant architectures (Schlatt et al., 10 Apr 2024)—define promising research trajectories. These approaches treat representation learning and task-adaptation as modular problems and advocate for explicit, intermediate adaptation layers rather than monolithic encoders. Directions for further investigation include exploring searching modules capable of selective alignment, dynamic thresholding mechanisms, and more powerful forms of unsupervised distillation—all aimed at balancing scalability, robustness, and expressive power.

In summary, bi-encoder and cross-encoder architectures offer fundamentally different trade-offs in efficiency and modeling capacity. The choice and configuration of these architectures, as well as emerging strategies to hybridize or modularize them, are central to the current engineering and scientific progress in information retrieval, matching, and related pairwise or setwise tasks. Recent empirical and theoretical work continues to refine the understanding of the core limitations, best practices for knowledge transfer, and new paradigms for model decomposition and modular adaptation.