Additive Two-Tower Models
- Additive two-tower models are machine learning or stochastic process architectures where two distinct components encode entities and combine outputs additively.
- A key application is in recommender systems for efficient, scalable retrieval and unbiased learning-to-rank by separately modeling relevance and bias.
- Practical deployment requires careful data collection (like item swaps) and techniques like Inverse Propensity Scoring (IPS) to ensure identifiability and mitigate selection bias.
Additive two-tower models are a class of statistical, machine learning, and stochastic process architectures in which two distinct components—often called "towers"—encode entities (such as users and items, or separate populations), and their outputs interact primarily through additive mechanisms. While modern recommender systems exemplify this approach with neural network-based two-tower models for efficient large-scale retrieval and unbiased learning-to-rank, the additive two-tower paradigm also appears in stochastic multi-type particle systems and population models. Recent research has focused both on theoretical properties (such as identifiability, duality, and convergence) and practical considerations (such as debiasing, interaction modeling, and scalable serving).
1. Formal Definition and Core Principle
At the heart of additive two-tower models is the principle of additive combination of independently parameterized modules. In the canonical learning-to-rank (LTR) setting, two separate models output logits (scores) for relevance and for bias, which are then summed and transformed (typically via a sigmoid) to predict the probability of a user action (e.g., a click):
Here,
- , , indicate the query, document/item, and rank/position;
- is the output of the bias tower (e.g., position attention);
- is the output of the relevance tower (e.g., content features);
- denotes the sigmoid function.
This structure decouples the modeling of bias and relevance but links them additively, which underpins both interpretability and effective debiasing, particularly when training from observational (biased) feedback such as clicks (2506.20501).
In the context of multi-type particle systems, additivity refers to the rule that the state evolution from superimposed initial conditions equals the suprema ("join") of the marginal evolutions—for example, for configurations :
where denotes a coordinate-wise join operation (1410.4809).
2. Identifiability, Duality, and Theoretical Guarantees
A critical theoretical consideration is model identifiability—whether the parameters of relevance and bias towers can be uniquely recovered from data. Additive two-tower architectures present a shift invariance: for any function ,
This implies the model is generically only identifiable up to a constant shift unless further constraints are imposed, such as anchoring parameters (e.g., setting ), ensuring overlapping feature distributions across positions, or introducing document swaps in the logging policy (2506.20501).
In multi-type additive growth models, duality plays a pivotal role: any additive process has a corresponding dual process, a property enabling rich probabilistic analysis and exact characterization of invariant measures and phase transitions (1410.4809). Positive correlations (FKG property) and monotonicity are determined by specific constraints on transition mappings—systems preserve these properties if every event mapping is monotonic (either non-decreasing or non-increasing in all coordinates).
Complete convergence has also been established: under irreducibility, translation invariance, and the presence of only "productive/destructive" interactions, the process converges in distribution to a mixture of extinction and survival measures.
3. Logging Policies, Debiasing, and Sample Weighting
In unbiased LTR, real-world logs are subject to non-random exposure due to production ranking policies, which can confound parameter recovery. When models are well-specified and identifiable, the logging policy introduces no bias. However, if the model fails to capture true user behavior (misspecification), and the residual error is correlated with exposure probabilities, logging policies can amplify estimation bias (2506.20501).
To address this, sample weighting (Inverse Propensity Scoring, IPS) is employed:
where is the empirical or modeled probability that document appears in position for query . This approach mitigates selection bias as long as all items have nonzero exposure across positions. The practical implication is that some randomness (swaps, shuffles) or feature overlap between positions is required for reliable debiasing and parameter identification.
4. Structural Extensions: Multi-type, Dual, and Interaction-enhanced Models
Additive two-tower frameworks generalize to multi-type systems and settings where towers represent interacting populations, structured subgroups, or different semantic facets (e.g., user and item embeddings in recommendation, or subpopulations in epidemiological models) (1410.4809). In these cases:
- Each "tower" may be an additive growth model, and their coupling determines overall system dynamics.
- Interactions between towers can be encoded additively if they respect the join operation or additive event coupling.
- Duality theory enables joint analysis, especially for extinction/survival and invariant distributions in large systems.
Recent advances incorporate richer feature interactions and cross-tower information sharing while maintaining computational advantages. For example, modules such as mixed-attention, multi-head representers, and code-based sparse indices provide fine-grained user-item interaction modeling and improved representational flexibility, as in RankTower, HIT, and SparCode models.
5. Efficiency, Scalability, and Industrial Deployment
A central practical appeal of additive two-tower models is their scalability. Item-side (or one-tower) embeddings can be precomputed and cached, so online inference per request only requires user-side computation and efficient similarity evaluation (often a dot product or more refined additive similarity). This design is essential for web-scale retrieval, as seen in online advertising and recommendation systems where serving latency and throughput are critical constraints (2210.09890, 2505.19849).
Extensions such as sparse embedding representations, hierarchical additivity, and code-based indices further reduce memory and compute overhead, enabling real-time serving of billions of impressions. Successful deployment in large-scale systems (e.g., Tencent's ad platform (2505.19849)) validates the industrial viability of additive enhancements.
6. Applications and Practical Recommendations
Additive two-tower models are used extensively for learning-to-rank with click bias correction, efficient retrieval in matching systems, and as frameworks for modeling multi-population interacting particle systems.
Practical recommendations for researchers and practitioners include:
- Ensure sufficient randomization or document swaps in data collection to guarantee identifiability;
- Monitor feature overlap across positions/ranks, as high-dimensional feature spaces can undermine parameter recovery;
- Apply IPS or other sample reweighting to mitigate selection bias dictated by logging policies;
- Avoid using logging policies that utilize features unavailable to the model to prevent hidden omitted variable bias;
- Leverage domain structure (e.g., productive/destructive transitions, crowding interactions) to ensure convergence and interpretability (1410.4809, 2506.20501).
7. Connections with Broader Modeling Frameworks
The additive two-tower principles integrate ideas from percolation theory, population biology models with crowding constraints, and duality theory in interacting stochastic systems. Architecturally, they underpin many deep learning frameworks in real-world industrial applications, especially where decoupled, scalable matching is required but bias and complex feature interplay must be rigorously addressed.
A summary of key theoretical and practical aspects is presented below:
Aspect | Theoretical Result or Principle | Practical Implication |
---|---|---|
Additivity & Duality | Additivity ⇔ dual process exists | Enables efficient debiasing/analysis |
Identifiability | Requires swaps or feature overlap | Collect randomized/swapped data |
Logging Policy Effect | Amplifies bias when model is misspecified | Use sample weighting, instrument data |
Positive Correlations | Preserved with monotonic event mappings | Enables coupling and attractiveness |
Complete Convergence | Holds for simple (productive/destructive) | Predicts long-term behavior |
Additive two-tower architectures provide a mathematically principled, efficient, and extensible toolkit for unbiased ranking, recommendation, and processes with structured multi-type populations, contingent on careful attention to identifiability, bias, and real-world serving constraints.