Dual-Tower Synergy for Consistent Indexing
- The paper introduces a dual-view indexing mechanism that aligns query and item embeddings, boosting coarse candidate selection and overall retrieval performance.
- It details a methodology where K-means clustering in query space and residual quantization in item space eliminate cross-tower spatial distortions.
- Empirical results show significant improvements in recall and ranking metrics on benchmarks like MS-MARCO and real-world e-commerce data.
Consistent Indexing with Dual-Tower Synergy Module (CI) is a framework developed to address limitations in large-scale dense retrieval systems, specifically those stemming from representational misalignment in dual-tower architectures. In conventional dense retrieval, dual-tower models separately encode queries and items into distinct embedding spaces. When such representations are merged within a single retrieval index, the resulting spatial mismatch can degrade matching accuracy, retrieval stability, and negatively impact performance on long-tail queries. The CI module introduces a dual-view indexing scheme that preserves semantic consistency between retrieval stages, integrates tightly with standard hierarchical indexing architectures (e.g., IVF-PQ), and supports billion-scale deployment without additional storage or online computational overhead (Wang et al., 15 Dec 2025).
1. Motivation and Problem Setting
Dense retrieval systems, which have become dominant in large-scale information retrieval due to their efficiency and accuracy, usually employ a coarse-to-fine hierarchical architecture. The prevalent dual-tower structure comprises two separate encoders: for queries and for items, producing embeddings in potentially misaligned spaces. During index construction and retrieval, this asymmetry engenders two primary issues:
- Space misalignment: Query and item embeddings are not guaranteed to share geometric consistency, distorting nearest neighbor retrieval.
- Index inconsistency: Clustering, residual quantization, and candidate selection may operate across heterogeneous embedding spaces, degrading both recall and ranking metrics.
These alignment issues become increasingly consequential in generative recommendation systems utilizing semantic identifiers, where conflicting geometry between training and inference reduces the capacity and generalization of downstream models (Wang et al., 15 Dec 2025).
2. Dual-View Indexing Strategy
To resolve representational inconsistencies, CI transforms the dual-tower pipeline into a two-view indexing mechanism. Offline, each item in corpus is processed as follows:
- Query-tower encoding: , providing a structural vector in the query embedding space.
- Item-tower encoding: , yielding a representation vector with potentially enriched, item-specific semantics.
K-means clustering is performed on to determine centroids in the query space. Every item is assigned to the nearest centroid in this space:
Residual vectors, representing item-specific detail, are then computed in the item space:
The index (e.g., IVF-PQ) maintains centroids in the query space and per-item product-quantized codes on the item-space residuals. This strictly segregates the structural (query-space) and residual (item-space) aspects of indexing, ensuring that the coarse candidate selection is always aligned with query geometry, while the fine stage leverages item-specific expressivity (Wang et al., 15 Dec 2025).
3. Formalization and Search Procedure
The CI search protocol is defined as follows:
- Embeddings: ,
- Similarity metric:
- Clustering: Centroids to derived from query-tower item representations.
- Residuals:
- Indexing: ANN structures are built with as coarse centroids and per-item quantized codes encoding .
At query time:
- A query is embedded via .
- The closest centroids to are selected.
- Items indexed under these centroids have their residuals decoded (typically via PQ).
- Each candidate item is scored by:
This aligns the initial candidate selection tightly with the learned query geometry, eliminating cross-tower distortions, while the fine-grained step exploits the rich representation of (Wang et al., 15 Dec 2025).
4. Theoretical Consistency
CI’s retrieval consistency theorem establishes that, under the condition that and are well aligned ( is small) and 's space is isotropic, the ANN search in attains the same coarse candidate coverage as the ideal objective . The argument proceeds as:
- If for all , then .
- For normalized, isotropic embeddings, maximizing over is equivalent to minimizing .
- Clustering and coarse filtering in space produces a quantized, yet consistent, approximation of nearest-neighbor objectives.
A corollary is that using for clustering while retaining for residual quantization preserves semantic consistency and enables finer discrimination between items, leveraging the greater expressiveness of the item tower (Wang et al., 15 Dec 2025).
5. Implementation Workflow and Pseudocode
The end-to-end CI construction and retrieval process can be summarized as:
| Step | Offline Index Construction | Online Retrieval |
|---|---|---|
| Input | Trained , , corpus | Query, |
| Encoding | , | |
| Clustering | K-means on : | Select top centroids |
| Assignment/Residual | , | As in index |
| Quantization/Indexing | PQ-encode , assign to list | PQ-decode for candidates |
| Ranking | Build IVF-PQ with centroids + PQ-codes |
No additional loss is introduced during indexing; the method depends on prior alignment of the towers (e.g., via input swapping in the SymmAligner module). The system is compatible with standard IVF-PQ codebooks and does not incur extra online latency (Wang et al., 15 Dec 2025).
6. Empirical Results and Engineering Considerations
In industrial-scale deployments, the CI module operates with cluster counts , probe numbers , and PQ code lengths of 64 bytes per item. Storage overhead is unchanged relative to conventional IVF-PQ, as CI reuses cluster centroids in query space and per-item codebooks on item-space residuals. Offline computation requires one additional forward pass of per document, an acceptable cost given offline indexing. Online latency remains identical, with query complexity where is embedding dimension.
The reported empirical enhancements include:
- MS-MARCO (nprobe=1): Recall@10 improves from 0.2767 to 0.3268 (approx. 18% relative), MRR@100 from 0.1771 to 0.2157.
- MS-MARCO (nprobe=64): MRR@100 increases from 0.4353 to 0.4480.
- Production e-commerce (1M items; 10M interactions): Recall@100 rises by 4% relative, NDCG@100 by 9.3% relative after indexing (Wang et al., 15 Dec 2025).
7. Summary and Significance
Consistent Indexing with Dual-Tower Synergy leverages aligned dual-tower embeddings to construct a two-view hierarchical index, where the coarse clustering and candidate selection are performed strictly in query-tower geometry, eliminating cross-tower spatial inconsistencies. The fine stage retains the full representational richness of the item tower, enabling improved recall and ranking without additional inference latency or storage costs. This approach is provably consistent, lightweight for engineering at billion-item scales, and validated by significant empirical improvements across public and industrial datasets (Wang et al., 15 Dec 2025).