Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Additive Two-Tower Models

Updated 1 July 2025

Additive two-tower models are machine learning or stochastic process architectures where two distinct components encode entities and combine outputs additively.
A key application is in recommender systems for efficient, scalable retrieval and unbiased learning-to-rank by separately modeling relevance and bias.
Practical deployment requires careful data collection (like item swaps) and techniques like Inverse Propensity Scoring (IPS) to ensure identifiability and mitigate selection bias.

Additive two-tower models are a class of statistical, machine learning, and stochastic process architectures in which two distinct components—often called "towers"—encode entities (such as users and items, or separate populations), and their outputs interact primarily through additive mechanisms. While modern recommender systems exemplify this approach with neural network-based two-tower models for efficient large-scale retrieval and unbiased learning-to-rank, the additive two-tower paradigm also appears in stochastic multi-type particle systems and population models. Recent research has focused both on theoretical properties (such as identifiability, duality, and convergence) and practical considerations (such as debiasing, interaction modeling, and scalable serving).

1. Formal Definition and Core Principle

At the heart of additive two-tower models is the principle of additive combination of independently parameterized modules. In the canonical learning-to-rank (LTR) setting, two separate models output logits (scores) for relevance and for bias, which are then summed and transformed (typically via a sigmoid) to predict the probability of a user action (e.g., a click):

$P(C = 1 \mid q, d, k) = \sigma\left(\theta_{k} + \gamma_{q,d}\right)$

Here,

$q$ , $d$ , $k$ indicate the query, document/item, and rank/position;
$\theta_k$ is the output of the bias tower (e.g., position attention);
$\gamma_{q,d}$ is the output of the relevance tower (e.g., content features);
$\sigma(\cdot)$ denotes the sigmoid function.

This structure decouples the modeling of bias and relevance but links them additively, which underpins both interpretability and effective debiasing, particularly when training from observational (biased) feedback such as clicks (2506.20501).

In the context of multi-type particle systems, additivity refers to the rule that the state evolution from superimposed initial conditions equals the suprema ("join") of the marginal evolutions—for example, for configurations $\eta_0, \eta_0'$ :

$\eta_t'' = \eta_t \vee \eta_t' \quad \text{when} \quad \eta_0'' = \eta_0 \vee \eta_0' \qquad \forall t$

where $\vee$ denotes a coordinate-wise join operation (1410.4809).

2. Identifiability, Duality, and Theoretical Guarantees

A critical theoretical consideration is model identifiability—whether the parameters of relevance and bias towers can be uniquely recovered from data. Additive two-tower architectures present a shift invariance: for any function $\Delta_k$ ,

$\theta_{k} + \gamma_{q,d} = (\theta_{k}+\Delta_k) + (\gamma_{q,d}-\Delta_k)$

This implies the model is generically only identifiable up to a constant shift unless further constraints are imposed, such as anchoring parameters (e.g., setting $\theta_1=0$ ), ensuring overlapping feature distributions across positions, or introducing document swaps in the logging policy (2506.20501).

In multi-type additive growth models, duality plays a pivotal role: any additive process has a corresponding dual process, a property enabling rich probabilistic analysis and exact characterization of invariant measures and phase transitions (1410.4809). Positive correlations (FKG property) and monotonicity are determined by specific constraints on transition mappings—systems preserve these properties if every event mapping is monotonic (either non-decreasing or non-increasing in all coordinates).

Complete convergence has also been established: under irreducibility, translation invariance, and the presence of only "productive/destructive" interactions, the process converges in distribution to a mixture of extinction and survival measures.

3. Logging Policies, Debiasing, and Sample Weighting

In unbiased LTR, real-world logs are subject to non-random exposure due to production ranking policies, which can confound parameter recovery. When models are well-specified and identifiable, the logging policy introduces no bias. However, if the model fails to capture true user behavior (misspecification), and the residual error is correlated with exposure probabilities, logging policies can amplify estimation bias (2506.20501).

To address this, sample weighting (Inverse Propensity Scoring, IPS) is employed:

$\mathcal{\hat{L}_{\text{IPS}}} = -\frac{1}{N}\sum_{(q,d,k,c)\in\mathcal{D}} \frac{1}{\pi(d,k|q)} \left[ c\log\sigma(\theta_k+\gamma_{q,d}) + (1-c)\log(1-\sigma(\theta_k+\gamma_{q,d}))\right]$

where $\pi(d, k \mid q)$ is the empirical or modeled probability that document $d$ appears in position $k$ for query $q$ . This approach mitigates selection bias as long as all items have nonzero exposure across positions. The practical implication is that some randomness (swaps, shuffles) or feature overlap between positions is required for reliable debiasing and parameter identification.

4. Structural Extensions: Multi-type, Dual, and Interaction-enhanced Models

Additive two-tower frameworks generalize to multi-type systems and settings where towers represent interacting populations, structured subgroups, or different semantic facets (e.g., user and item embeddings in recommendation, or subpopulations in epidemiological models) (1410.4809). In these cases:

Each "tower" may be an additive growth model, and their coupling determines overall system dynamics.
Interactions between towers can be encoded additively if they respect the join operation or additive event coupling.
Duality theory enables joint analysis, especially for extinction/survival and invariant distributions in large systems.

Recent advances incorporate richer feature interactions and cross-tower information sharing while maintaining computational advantages. For example, modules such as mixed-attention, multi-head representers, and code-based sparse indices provide fine-grained user-item interaction modeling and improved representational flexibility, as in RankTower, HIT, and SparCode models.

5. Efficiency, Scalability, and Industrial Deployment

A central practical appeal of additive two-tower models is their scalability. Item-side (or one-tower) embeddings can be precomputed and cached, so online inference per request only requires user-side computation and efficient similarity evaluation (often a dot product or more refined additive similarity). This design is essential for web-scale retrieval, as seen in online advertising and recommendation systems where serving latency and throughput are critical constraints (2210.09890, 2505.19849).

Extensions such as sparse embedding representations, hierarchical additivity, and code-based indices further reduce memory and compute overhead, enabling real-time serving of billions of impressions. Successful deployment in large-scale systems (e.g., Tencent's ad platform (2505.19849)) validates the industrial viability of additive enhancements.

6. Applications and Practical Recommendations

Additive two-tower models are used extensively for learning-to-rank with click bias correction, efficient retrieval in matching systems, and as frameworks for modeling multi-population interacting particle systems.

Practical recommendations for researchers and practitioners include:

Ensure sufficient randomization or document swaps in data collection to guarantee identifiability;
Monitor feature overlap across positions/ranks, as high-dimensional feature spaces can undermine parameter recovery;
Apply IPS or other sample reweighting to mitigate selection bias dictated by logging policies;
Avoid using logging policies that utilize features unavailable to the model to prevent hidden omitted variable bias;
Leverage domain structure (e.g., productive/destructive transitions, crowding interactions) to ensure convergence and interpretability (1410.4809, 2506.20501).

7. Connections with Broader Modeling Frameworks

The additive two-tower principles integrate ideas from percolation theory, population biology models with crowding constraints, and duality theory in interacting stochastic systems. Architecturally, they underpin many deep learning frameworks in real-world industrial applications, especially where decoupled, scalable matching is required but bias and complex feature interplay must be rigorously addressed.

A summary of key theoretical and practical aspects is presented below:

Aspect	Theoretical Result or Principle	Practical Implication
Additivity & Duality	Additivity ⇔ dual process exists	Enables efficient debiasing/analysis
Identifiability	Requires swaps or feature overlap	Collect randomized/swapped data
Logging Policy Effect	Amplifies bias when model is misspecified	Use sample weighting, instrument data
Positive Correlations	Preserved with monotonic event mappings	Enables coupling and attractiveness
Complete Convergence	Holds for simple (productive/destructive)	Predicts long-term behavior

Additive two-tower architectures provide a mathematically principled, efficient, and extensible toolkit for unbiased ranking, recommendation, and processes with structured multi-type populations, contingent on careful attention to identifiability, bias, and real-world serving constraints.

PDF Markdown Chat (Upgrade)

References (4)

Unidentified and Confounded? Understanding Two-Tower Models for Unbiased Learning to Rank (2025)

Duality and Complete Convergence for Multi-type Additive Growth Models (2014)

IntTower: the Next Generation of Two-Tower Model for Pre-Ranking System (2022)

HIT Model: A Hierarchical Interaction-Enhanced Two-Tower Model for Pre-Ranking Systems (2025)