Two-Tower Dual Encoder Models

Updated 13 April 2026

Two-tower dual encoder models are architectures that use separate networks to encode distinct inputs into embeddings which are then compared via dot product for fast similarity scoring.
They employ contrastive loss with advanced negative sampling techniques, such as hard negatives and cross-batch examples, to enhance ranking performance.
Recent advances integrate early interaction modules and multimodal extensions to balance richer cross-tower signals with the efficiency of decoupled retrieval.

A two-tower (dual-encoder) model consists of two separate neural encoding networks—each operating on a different input (e.g., a query and a candidate document, user and item, text and image)—whose outputs are combined by a simple interaction function (typically dot product), yielding a similarity or relevance score. This paradigm enables scalable retrieval and ranking across massive candidate sets, as one side’s representations may be precomputed and indexed for fast nearest neighbor search at query time. Two-tower models are foundational in information retrieval, recommendation, vision-language alignment, and unbiased learning-to-rank.

1. Architectural Definition and Core Principles

In the canonical two-tower setup, each input $x$ (e.g., query, user, audio, text) is encoded as $f_Q(x)$ and each counterpart $y$ (e.g., document, item, image, candidate label) as $f_D(y)$ , yielding embeddings in $\mathbb{R}^d$ . The model score is a symmetric function, often the inner product: $s(x, y) = f_Q(x)^\top f_D(y)$ In learning-to-rank with position bias, the model is extended to additive form: $P(C=1\mid q,d,k) = \sigma(\theta_k + f_Q(q)^\top f_D(d))$ where $\theta_k$ models position-dependent bias and $\sigma(\cdot)$ is the sigmoid function (Hager et al., 29 Aug 2025, Hager et al., 25 Jun 2025).

The major architectural variants are:

Siamese Dual Encoder (SDE): all encoder parameters are shared, forcing embeddings into a single latent space.
Asymmetric Dual Encoder (ADE): encoders for each modality have independent parameters, allowing for tuning to domain mismatch, at the cost of potential representation misalignment (Dong et al., 2022, Moiseev et al., 2023).

Parameter sharing, especially in the projection layer, plays a critical role in embedding-space alignment and retrieval efficacy (Dong et al., 2022).

2. Training Objectives: Contrastive Loss and Negative Sampling

Two-tower models are almost universally trained with contrastive losses over large volumes of pairwise (or higher-order) input data. For retrieval, the objective is to maximize similarity of true pairs while minimizing it for negatives: $\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \log\frac{e^{\mathrm{sim}(q_i,p_i)/\tau}}{\sum_{j=1}^N e^{\mathrm{sim}(q_i,p_j)/\tau}}$ where sim is cosine similarity or dot-product, and the denominator includes in-batch negatives (other candidates in the minibatch) (Moiseev et al., 2023, Liu et al., 2022, Ni et al., 2021).

Refinements have emerged:

Hard negative mining: inclusion of negatives highly similar to queries to strengthen discriminative power (Bhowmik et al., 2021).
Same-tower negatives: augmenting the contrastive denominator with within-tower hard negatives (e.g., queries vs other queries) to regularize and align the spaces, as in SamToNe (Moiseev et al., 2023).
Cross-batch negative sampling (CBNS): leveraging embeddings persisted across recent minibatches to increase the negative pool size with negligible overhead (Wang et al., 2021).
Auxiliary regularization: InfoNCE-based interaction regularization for fine-grained alignment (Li et al., 2022).

3. Identifiability and Bias in Learning-to-Rank

Two-tower architectures for unbiased learning-to-rank can suffer from non-identifiability: different parameterizations of positional and relevance scores can yield identical predictions if, for each $f_Q(x)$ 0, only a single rank is observed. Sufficient conditions for identifiability include:

Randomized swaps: placing the same document at multiple positions forms a connected positional graph, enabling unique recovery up to a global shift (Hager et al., 29 Aug 2025, Hager et al., 25 Jun 2025).
Overlapping feature distributions: for learned relevance functions $f_Q(x)$ 1, identifiability follows if feature support overlaps across positions and $f_Q(x)$ 2 is continuous.

The influence of the logging/display policy is characterized by:

Well-specified models: when the model class can realize the click probabilities exactly, the logging policy $f_Q(x)$ 3 does not bias the optimum.
Misspecification: when the model cannot realize $f_Q(x)$ 4, systematic correlation between model errors and $f_Q(x)$ 5 biases learned parameters, especially as exposure becomes deterministic (e.g., production logs from highly optimized rankers).
Inverse propensity weighting (IPS): reweighting samples by $f_Q(x)$ 6 flattens exposure, mitigating bias amplification in $f_Q(x)$ 7 for misspecified models (Hager et al., 29 Aug 2025, Hager et al., 25 Jun 2025).

Empirical studies confirm that identifiability (via swaps or feature overlap) is necessary and sufficient for unbiased parameter recovery, and IPS weighting is effective in debiasing under stochastic logging.

4. Recent Advances in Cross-Tower Interaction and Efficiency

A limitation of classic two-tower models is the decoupled scoring function, which impairs accuracy compared to cross-encoder (late interaction) models. Contemporary research extends the paradigm as follows:

Early/fine-grained interaction: Modules such as FE-Block (multi-head, early projection) or dual-generator frameworks are inserted pre- or mid-encoding to enable richer cross-tower signal propagation, as in IntTower (Li et al., 2022) and HIT (Yang et al., 26 May 2025).
Hierarchical/multi-head matchers: Multi-head projections with “max-then-sum” scoring across subspaces (HIT), enabling modeling of complex, multi-faceted user-item or user-ad relations (Yang et al., 26 May 2025).
Implicit/contrastive regularization: Contrastive interaction regularization (CIR) penalizes misalignment between multi-head projections of positive and negative pairs (Li et al., 2022).
Diffusion-based modules: Estimating user next-intention embeddings via a learned denoising diffusion process, then cross-attending these synthetic representations to historical behaviors in T2Diff (Wang et al., 28 Feb 2025).

Despite these augmentations, leading architectures (e.g., HIT, IntTower, T2Diff) report inference costs and serving latency comparable to base two-tower models thanks to maintained decoupling for item-side embeddings and careful offline/online division of computation.

5. Dual Encoder Extensions in Multimodal and Vision-Language Domains

Two-tower models are now foundational in vision-language (VL) and multimodal retrieval systems:

Standard VL form: Separate encoders for image (e.g., ViT, CNN, ResNet) and text (Transformer, RoBERTa), with similarity in joint space. Training is via InfoNCE on millions of paired samples (e.g., CLIP-style) (Cheng et al., 2024).
Model alignment and paraphrase consistency: Freezing a pretrained LLM for the text tower and appending alignment layers preserves semantic similarity and increases paraphrastic robustness, ensuring consistent retrieval across paraphrased queries (Cheng et al., 2024).
Cross-modal knowledge injection: Distillation of cross-modal attention (from fusion-encoders to dual-encoders) substantially closes the gap to joint models’ VQA performance while preserving inference efficiency (Wang et al., 2021).
Multi-level feature fusion: BridgeTower-style architectures inject features from all upper uni-modal layers into each cross-modal layer, achieving stronger hierarchical VL alignment than purely last-layer fusion (Xu et al., 2022).
Multimodal failures and recommendations: Instrument recognition studies reveal that current two-tower architectures excel on audio encoders but fail to compositionally leverage text context, advocating targeted fine-tuning of text encoders with domain-specific corpora to restore semantic structure (Vasilakis et al., 2024).

6. Efficiency, Negative Sampling, and Practical Considerations

The sub-millisecond retrieval capability of two-tower models rests on three pillars:

Decoupling: Precomputing and caching the item-side (“right-tower”) embeddings, so at test time only the user/query is encoded and compared via efficient inner product (Bhowmik et al., 2021, Ni et al., 2021).
Large Minibatch/Negative Pools: Scaling up contrastive learning with in-batch, cross-batch (CBNS), and memory bank-based negative sampling dramatically accelerates convergence and improves ranking performance (Wang et al., 2021, Moiseev et al., 2023).
One-backpropagation optimization: Asymmetric updating strategies (backpropagating only through the item tower, while aggregating user embeddings via moving-average) further reduce training time with positive impacts on retrieval quality (Chen et al., 2024).

Rigorous empirical ablation indicates that hard-negative sampling, embedding-space alignment (via projection-layer sharing), and efficient sample weighting are critical to high-performance two-tower retrievers (Moiseev et al., 2023, Dong et al., 2022).

7. Embedding-Space Alignment and Interpretability

Embedding-space alignment is central to retrieval efficacy. When the two towers’ output spaces are misaligned (as in fully asymmetric ADE), queries and items cluster separately, leading to suboptimal dot-product matching. Sharing the final projection layer (ADE-SPL) or using contrastive regularization (SamToNe) drives the two embedding spaces into close proximity—as evident in t-SNE plots and sharper similarity distributions—yielding both improved ranking and enhanced semantic consistency (Dong et al., 2022, Moiseev et al., 2023).

Interpretability is further improved through:

Attention pooling over tokens (for dialogue): Focusing on decisive tokens in context/response pairs and minimizing mutual information between unattended context and output (Li et al., 2020).
Residual word embeddings: Retaining a fraction of the token’s raw features at each prediction step, yielding sharper and more interpretable token-level influence maps (Li et al., 2020).

References

The two-tower (dual-encoder) paradigm continues to underlie advances in scalable retrieval, recommendation, and multimodal learning, with ongoing research addressing cross-tower interaction, identifiability, negation of bias from learning to rank, and semantic alignment across heterogeneous input spaces.