Two-Tower Bi-Encoder
- The two-tower bi-encoder is a dual neural architecture that independently maps two distinct inputs into a shared embedding space to enable efficient similarity computation.
- Its design variants, including Siamese and asymmetric configurations, optimize parameter sharing and projection alignment to enhance retrieval performance.
- Advanced training methodologies, such as contrastive losses and negative sampling, further improve its scalability and accuracy in multi-modal and retrieval tasks.
A two-tower bi-encoder is a neural architecture comprising two distinct encoders (“towers”) that independently map two different types of input into a shared or comparable vector space, enabling efficient affinity computation via a simple similarity metric. In retrieval, classification, or matching scenarios, the two encoders typically process distinct modalities or roles such as query–document, user–item, audio–text, or span–type. Key architectural variants differ in how parameters are shared between towers and how projection to the final embedding space is realized. Two-tower designs are prized for decomposing representations across modalities or roles, supporting scalable offline embedding and sublinear retrieval, and for admitting efficient negative sampling and contrastive training approaches.
1. Core Architecture and Variants
The canonical two-tower bi-encoder defines two encoders: for “queries” and for “documents” (or, more generically, the two modalities or roles). Each encoder maps its input to a vector , and affinity is computed by a function such as dot product or cosine similarity:
The principal architectural variants are defined by parameter sharing:
- Siamese Dual Encoder (SDE): and share all parameters—usually a single transformer + projection used for both sides. SDEs map input types into a fully shared space ensuring maximum alignment.
- Asymmetric Dual Encoder (ADE): and are independent, having separate parameters both in base encoders and projections. This allows for adaptation to modality- or role-specific statistics, but can result in “unaligned” embedding spaces and impaired retrieval when using dot-product scoring.
- Hybrid ADE variants: Parameter sharing is restricted to certain subcomponents, chiefly the projection layer (ADE-SPL), the token embedding table (ADE-STE), or through freezing the token embedder (ADE-FTE) (Dong et al., 2022).
A further class encompasses cross-modal and pre-ranking architectures where the towers process heterogeneous inputs (e.g., images and texts in CLIP-style systems (Hönig et al., 2023, Vasilakis et al., 2024), user–item in recommendation (Li et al., 2022, Chen et al., 2024)), and special handling of information bottlenecks (projection MLPs, interaction modules).
2. Training Methodologies and Loss Formulations
Two-tower bi-encoders are predominantly trained with contrastive objectives based on batches of paired positive examples and synthesized negatives. The prototypical contrastive loss is the in-batch InfoNCE, where, for a batch of query-document (or equivalent) pairs, each query is paired with its positive and all others in the batch serve as negatives: with temperature . This formulation ensures positive pairs are mapped closer in embedding space than random negatives.
Advanced methods:
- Cross-Batch Negative Sampling (CBNS): Maintains a bank of cached item embeddings from previous batches to enlarge the effective negative pool beyond batch size, leveraging the empirical stability of deep representations to reduce computational overhead without sacrificing negative diversity (Wang et al., 2021).
- Same-Tower Negatives (SamToNe): Augments denominators in the contrastive loss with samples drawn from the same tower (e.g., other queries as negatives for a given query), regularizing the alignment and distributing representations more evenly around the unit sphere (Moiseev et al., 2023).
- Unbiased Learning-to-Rank Losses: Incorporate explicit handling of presentation or selection bias, adding a context/bias logit (e.g., position) through a third (bias) tower or as an additive term in the logit space. Risk is typically minimized with inverse propensity weighting when necessary (Hager et al., 25 Jun 2025).
3. Parameter Sharing, Embedding Alignment, and Empirical Performance
Empirical studies place strong emphasis on the impact of parameter sharing—especially at the projection layer—on retrieval quality and on the geometric alignment of the two spaces:
- SDE vs. ADE: SDE almost uniformly exceeds ADE in QA retrieval and open-domain ranking tasks (Dong et al., 2022).
- Projection Sharing (ADE-SPL): Sharing only the projection layer achieves retrieval metrics nearly matching or even exceeding full SDE, as quantified on MS MARCO, Natural Questions, and MultiReQA (Dong et al., 2022).
- Embedding Analyses: t-SNE visualization and distance statistics reveal that projection sharing is crucial for overlap of the question and document embedding manifolds, directly controlling their suitability for dot-product nearest-neighbor retrieval. Separate projections induce misaligned, disjoint spaces (Dong et al., 2022, Moiseev et al., 2023).
Quantitative results for parameterizations on retrieval tasks:
| Model | MS MARCO MRR | MS MARCO P@1 | NQ MRR | NQ P@1 |
|---|---|---|---|---|
| SDE | 28.49% | 15.92% | 61.15% | 48.87% |
| ADE | 26.31% | 14.20% | 59.38% | 47.83% |
| ADE-SPL | 28.20% | 15.46% | 61.92% | 50.06% |
4. Architectural Innovations for Efficiency and Interaction
While the two-tower paradigm is motivated by scalability—enabling offline computation and caching of one side’s embeddings, fast indexing, and large-scale retrieval—recent work has addressed its tendency to omit cross-entity interactions:
- Interaction modules: FE-Blocks in IntTower inject fine-grained, early, multi-head interactions between user and item features, capturing explicit alignment signals, while still enabling pre-computation and scalable inference (Li et al., 2022).
- Contrastive regularization: IntTower supplements binary prediction loss with intermediate InfoNCE-based contrastive signals, further aligning representations (Li et al., 2022).
- Bridge layers in vision–language: BridgeTower wires the top- layers of each uni-modal encoder into each corresponding cross-modal decoder layer, overcoming the late-fusion limitation and yielding multi-level semantic alignment (Xu et al., 2022).
- Learning-to-rank bias handling: Additive bias terms for presentation (position, device), modeled as a separate tower, coupled optionally with explicit randomization or inverse propensity weighting, ensure unbiased estimation under practical logging policies (Hager et al., 25 Jun 2025, Hager et al., 29 Aug 2025).
Cascading two-tower encoders of increasing cost enables large-scale efficiency: a weak model is used to prefilter candidates, followed by a stronger model on a reduced set, achieving 3–6× lower lifetime encoding cost at near-uncompromised accuracy (Hönig et al., 2023).
5. Modalities and Domain-Specific Extensions
Two-tower architectures are now standard in multi-modal retrieval, zero-shot recognition, recommender systems, and more:
- Multimodal pairing: Audio–text systems for instrument recognition and vision–language retrieval pretrain separate encoders and project to a joint space, using contrastive InfoNCE loss (Vasilakis et al., 2024, Xu et al., 2022). In music, the main bottleneck remains in the text encoder, highlighting the need for domain adaptation or more expressive projections (Vasilakis et al., 2024).
- NER with contrastive bi-encoders: Named entity spans and type descriptors are embedded into a shared space via decoupled transformer towers, and a dynamic thresholding loss learns optimal boundaries for predicting entities without over-penalizing non-entity (“O”-class) spans (Zhang et al., 2022).
- Recommendation: Standard user–item towers can be trained with asymmetric updates, e.g., moving-aggregation or “one backpropagation” schemes, leading to better metric performance and reduced computation. IntTower introduces attention (Light-SE), FE-Block, and CIR modules for more effective user–item match scoring (Chen et al., 2024, Li et al., 2022).
6. Analysis of Identifiability, Bias, and Robustness
Recent theoretical work has clarified the identifiability of additive bi-encoder models—i.e., the extent to which the underlying parameters can be uniquely decomposed from observed click or interaction data:
- Additive shift invariance: The sum of bias and relevance is preserved under joint shifts; to anchor unique solutions, either explicit randomization in presentation positions or systematic feature-overlap across positions is required (Hager et al., 29 Aug 2025, Hager et al., 25 Jun 2025).
- Feature overlap and graph connectivity: If the feature distributions at each presentation position have overlapping support and the relevance MLP is continuous, the model’s parameters are uniquely identifiable up to a global shift (Hager et al., 25 Jun 2025).
- Confounding by logging policies: Misspecification can interact with deterministic or strongly biased logging, causing systematic distortion of learned relevance and bias parameters. Inverse propensity weighting is proposed for mitigation, but requires positivity (every item must be observable at every position) (Hager et al., 25 Jun 2025, Hager et al., 29 Aug 2025).
Best practices emphasize: monitoring residual–propensity correlations, explicit data collection for randomization or overlap, matching logging and model features, controlled regularization of model expressivity, and careful validation of both accuracy and parameter recovery.
7. Practical Guidelines and Current Limitations
Best practices for two-tower bi-encoder design and deployment, converging from multiple empirical and theoretical fronts, include:
- Parameter sharing: Prefer at least projection-level sharing (ADE-SPL) unless strong role heterogeneity exists.
- Loss selection: Use in-batch InfoNCE with sufficiently large negative pools (augmented by CBNS or SamToNe for difficult cases).
- Alignment diagnostics: Assess representation overlap and metric spread via t-SNE or similarity histograms; embedding collapse or misalignment is a red flag (Dong et al., 2022, Moiseev et al., 2023).
- Interaction: Add multi-head cross-tower modules (e.g., FE-Block) or bridge connections (vision–language) when pure late fusion underperforms.
- Bias correction: Incorporate inverse propensity weighting and encourage randomization or feature overlap for unbiased click modeling.
- Domain adaptation: For cross-modal or specialized domains (e.g. music, NER), fine-tune under domain-specific constraints and consider joint or dynamic thresholding in decision boundaries.
Current limitations include the challenge of semantic alignment for bottlenecked modalities (e.g. text in music retrieval (Vasilakis et al., 2024)), quadratic enumeration costs in span-based NER (Zhang et al., 2022), and sensitivity to data distribution/feature support in unbiased learning-to-rank. Future research targets enriched compositionality in modality-poor branches, scalable negative mining, and more expressive, interaction-rich architectures that preserve the efficiency benefits of the bi-encoder paradigm.