Two-tower Retrieval Overview

Updated 24 December 2025

Two-tower Retrieval is a dual-encoder architecture that independently encodes queries and items into a shared embedding space for scalable matching.
It utilizes contrastive training with negative sampling to pull positive pairs together while efficiently separating negatives for improved retrieval.
Extensions include cross-interactions, conditional retrieval, and domain adaptation strategies that enhance performance in large-scale, multimodal applications.

Two-tower retrieval is a dominant paradigm in neural candidate retrieval and large-scale matching systems across search, recommendation, and multimodal retrieval. It leverages a dual-encoder architecture in which queries and items are encoded independently, typically into a shared embedding space, enabling scalable and efficient approximate nearest neighbor (ANN) search. This architecture supports efficient large-scale inference with precomputed indexes for items and online embedding for queries, offering a practical tradeoff between retrieval accuracy and computational cost. Recent work has extended the classic two-tower design along multiple dimensions, including domain adaptation, multimodality, LLM scaling, explicit feature interaction, conditional retrieval, and theoretical analysis.

1. Architectural Foundations and Variants

The canonical two-tower architecture consists of two separate encoders: one for queries (or users) and one for items (or documents, products, passages, modalities). Given a query $x$ and an item $y$ , their embeddings are computed as $q = f_q(x)$ and $p = f_p(y)$ (potentially with BERT, ResNet, or Transformer backbones) (Liang et al., 2020, Vasilakis et al., 25 Jul 2024, Kekuda et al., 3 May 2025). These encoders may share parameters (Siamese) or be decoupled ("two-tower"). The outputs are projected—sometimes via linear layers or MLPs—to a joint embedding space. Similarity is commonly scored by dot product or cosine similarity: $s(q, p) = \langle f_q(x), f_p(y) \rangle \quad \text{or} \quad \text{sim}(e_q, e_p) = \frac{e_q \cdot e_p}{\|e_q\|\,\|e_p\|}$ This design supports precomputing and indexing item embeddings for fast ANN retrieval (Su et al., 2023, Osowska-Kurczab et al., 19 Jul 2025, Li et al., 2023).

Typical extensions adapt the encoder structures for modalities beyond text (e.g., audio (Vasilakis et al., 25 Jul 2024), video (Lan et al., 5 Sep 2025)), introduce domain-specific tokenization or field aggregation (Kekuda et al., 3 May 2025, Osowska-Kurczab et al., 19 Jul 2025), or apply alternative normalizations (e.g., $\ell_1$ normalization for chi-square kernels) (Li et al., 2023).

2. Training Objectives and Negative Sampling

Two-tower models are trained with a contrastive (InfoNCE) or cross-entropy loss formulated to pull query–positive item pairs together and push apart negatives. The standard loss is: $L = -\log \frac{\exp s(q, p^+)}{\sum_{i=1}^N \exp s(q, p_i)}$ where $p^+$ is the relevant item and $\{p_i\}$ are negatives drawn from the batch or corpus (Liang et al., 2020). Bidirectional losses sum over query-to-item and item-to-query directions (Vasilakis et al., 25 Jul 2024, Moiseev et al., 2023). Advanced schemes include in-batch negatives, hard negatives, and, in SamToNe, augmenting denominators with same-tower negatives, i.e., competing queries are explicitly penalized to prevent collapse and improve space alignment (Moiseev et al., 2023). In cross-modal contexts, weighted sampling and multi-field negatives enhance robustness (Vasilakis et al., 25 Jul 2024).

Models may also incorporate auxiliary objectives, such as synthetic query generation for zero-shot scenarios (Liang et al., 2020) or joint contrastive/semantic supervision (Vasilakis et al., 25 Jul 2024, Moiseev et al., 2023).

3. Scaling, Domain Adaptation, and Synthetic Data

Large-scale deployments often encounter sparse and long-tail distributions of queries, products, or interactions. To address data scarcity, synthetic positives may be generated using LLMs: e.g., LLaMA-based query generators create synthetic query–product pairs for tail products, substantially improving recall and conversion in e-commerce applications (Kekuda et al., 3 May 2025). Pretraining encoders on domain-specific corpora (e.g., proprietary catalogs or large query logs) provides strong initialization, while sequential fine-tuning strategies (such as query–query followed by query–product) further specialize the embedding space.

In high-scale production, merging models via weight averaging ("model soup") enables leveraging complementary strengths of multiple finetuned variants without increasing inference cost (Kekuda et al., 3 May 2025).

For cross-domain and zero-shot transfer, synthetic query generation pipelines (e.g., BART generation of queries given unlabeled passages) enable retrieval models to generalize to new domains and outperform classical lexical baselines such as BM25 on most established benchmarks (Liang et al., 2020).

4. Extensions: Cross-Interaction, Conditional Retrieval, and Hybrid Models

Classic two-tower models are limited in their expressivity, capturing only global (vector-level) similarity, and do not explicitly model fine-grained feature or cross-modal interactions. Multiple strategies extend this paradigm:

Fine-grained cross-interactions: SparCode introduces all-to-all interactions between quantized code embeddings for the query and tokenized items. Discrete codebooks and a sparse inverted index allow O(1) retrieval complexity while matching or exceeding classic two-tower accuracy, controlled via sparsity thresholding (Su et al., 2023).
Hybrid architectures: Hybrid-Tower models inject fine-grained pseudo-query interactions offline (e.g., pre-generating a text-like feature from video frames and patches, fused into the stored video representation). On retrieval, the model matches the speed of two-tower systems but approaches the accuracy of compute-intensive single-tower models (Lan et al., 5 Sep 2025).
Conditional retrieval: Instead of post-filtering, side conditions (e.g., topic, price-range, merchant) are concatenated into the user embedding at training and inference, yielding conditional retrieval models. This allows bootstrapping condition-specific feeds using only standard engagement logs and minimal serving overhead, with substantial gains on both topical and engagement metrics (Lin et al., 22 Aug 2025).
Cross-interaction decoupling: T2Diff leverages a diffusion module to reconstruct the user's next positive intention entirely within the user tower, complemented by explicit session-history fusion via mixed attention. This architecture outperforms earlier cross-tower interaction methods and closes the gap with heavy cross-encoder models, with negligible latency overhead (Wang et al., 28 Feb 2025).

5. Theoretical Analysis, Efficiency, and Scaling

Two-tower retrieval architectures achieve computational scalability by indexing (precomputing) all item embeddings and encoding queries (users) online. Retrieval complexity is $O(d \log |I|)$ (e.g., with HNSW, Faiss, IVF-PQ) per query, independent of corpus size. For extremely sparse embeddings (induced via $\ell_1$ /ReLU), similarity computations become much faster and allow further acceleration by hashing (e.g., with Sign Cauchy Random Projections, yielding up to 20× gains in large-scale systems) (Li et al., 2023).

Recent theoretical results illuminate optimality and new training regimes:

Distillation and joint optimization: LT-TTD integrates two-tower retrieval with a transformer re-ranker via listwise and distillation objectives. Theoretical guarantees include a provable reduction in irretrievable relevant items (as a function of distillation strength), and global optimality of joint training over disjoint optimization, with bounded complexity (Abraich, 7 May 2025).
Scaling with LLMs: ScalingNote demonstrates that scaling both query and document towers with LLM backbones, followed by query-only distillation for online service, delivers outsized improvements: recall@K increases by 7%+ over previous best, with full LLM-stage power available offline and lightweight BERT-like towers for realtime inference. Empirical scaling laws for model and data size are established (Huang et al., 24 Nov 2024).

6. Practical Applications and Domain-Specific Adaptations

Two-tower retrieval is extensively used in commercial systems, including product and content search (Best Buy (Kekuda et al., 3 May 2025), Allegro.com (Osowska-Kurczab et al., 19 Jul 2025)), music and audio retrieval (CLAP, MusCALL (Vasilakis et al., 25 Jul 2024)), document/dense passage retrieval (BART/BERT (Liang et al., 2020)), user-to-item recommendations in social media and e-commerce (Lin et al., 22 Aug 2025, Su et al., 2023), and text-to-video (Lan et al., 5 Sep 2025). In all settings, item representations are indexed for fast retrieval, and high-throughput candidate generation is performed online at low latency (typically sub-100 ms per query at scale).

Models generalize across domains via synthetic data and domain-adaptive pretraining, and deploy robustly even in cold-start scenarios, as content-based encoders obviate the need for pre-accumulated behavior data (Osowska-Kurczab et al., 19 Jul 2025).

Multi-tower variants, domain-specific text augmentation, prompt ensembling (for multimodal prompt robustness), and semantic supervision with ontologies (to inject hierarchical knowledge) are emerging best practices (Vasilakis et al., 25 Jul 2024).

7. Evaluation, Challenges, and Future Directions

Evaluation protocols rely on recall@k, NDCG@k, MR/PR AUC (for retrieval/classification), and online metrics such as conversion and click-through rates. The introduction of aggregate metrics—such as UPQE, which combines NDCG, error propagation, and computational costs—enables holistic assessment of unified ranking architectures (Abraich, 7 May 2025).

Despite progress, important challenges remain:

Feature interaction limits: Many domains (e.g., music/audio, cross-modal) expose the limited semantic depth of standard text encoders, and current models often depend heavily on the prompt or label wording (Vasilakis et al., 25 Jul 2024).
Tradeoff in cross-interaction and efficiency: Methods combining “pseudo-query” offline fusion and sparse all-to-all interaction offer a promising balance, but full equivalence to cross-encoders is not yet achieved (Lan et al., 5 Sep 2025, Su et al., 2023).
Noise in synthetic data and “semantic drift” in domain adaptation remain open challenges (Liang et al., 2020).
Scalability with LLMs is practically constrained by online latency, mandating hybridization with distillation and multi-resolution indexing (Huang et al., 24 Nov 2024).
Cold-start and index maintenance are critical engineering considerations as catalog and user features evolve rapidly (Osowska-Kurczab et al., 19 Jul 2025).

Future work targets further improving prompt robustness and semantic coverage (e.g., ontology-driven contrastive training), integrating smarter negative mining, leveraging multi-task pretraining, and continuing advances in efficient cross-encoder hybridization. The modularity of two-tower systems allows easy incorporation of novel architectures and data pipelines, maintaining their central role in industrial-scale retrieval and matching.

References