User & Candidate DualTransformer (UCDT)

Updated 3 December 2025

The paper presents a dual-branch Transformer architecture that fuses user history and candidate item features via cross-attention.
Its methodology leverages hierarchical Sequential Transducer Units to create context-sensitive representations for improved CTR prediction.
Empirical results demonstrate significant gains in AUC and live CTR, validating UCDT’s effectiveness in modern ranking systems.

The User and Candidate DualTransformer (UCDT) is a dual-branch architecture designed for fine-grained modeling of user-item-context interactions within deep candidate ranking and reranking frameworks. Introduced as the foundational module within the RIA (Ranking-Infused Architecture) for listwise click-through rate (CTR) prediction, UCDT encapsulates two hierarchical Transformer-style encoding branches—one for users (with contextual history) and one for candidate items—followed by a cross-attention mechanism that fuses signals across user and candidate representations. Through this architectural split and targeted attention, UCDT enables rich, position-aware, and context-sensitive representations, supporting both pointwise CTR estimation and downstream listwise modeling for improved online ranking performance (Zhang et al., 26 Nov 2025).

1. Architecture Overview

UCDT comprises two parallel Transformer-inspired branches that process user–context history and candidate item lists concurrently. Formally, the model ingests:

$E^u \in \mathbb{R}^{T \times D}$ : an embedding matrix for a sequence of user and context features over time steps $T$ .
$X \in \mathbb{R}^{n \times D}$ : an embedding matrix for $n$ candidate items.

Each branch employs a stack of Hierarchical Sequential Transducer Unit (HSTU) blocks—miniature Transformers—applied independently:

$X' = \mathrm{HSTU}(X)$
$E^{u'} = \mathrm{HSTU}(E^u)$

An optional positional embedding matrix $P \in \mathbb{R}^{L \times D_p}$ can be added to inject order-sensitivity.

Subsequently, a target attention module allows each candidate $x'_i$ to perform multi-head cross-attention over the user/context sequence $E^{u'}$ :

$x''_i = \mathrm{Attention}(x'_i, \{e^{u'}_j\}_{j=1}^T)$

The attended candidate vectors $\{x''_i\}_{i=1}^n$ participate in both pointwise CTR prediction (via a feed-forward classifier) and serve as preconditioned representations for deeper listwise modules.

2. Mathematical Formalization

Input and Embedding:

Candidate list: $X \in \mathbb{R}^{n \times D}$
User/context sequence: $E^u \in \mathbb{R}^{T \times D}$
Optionally, positional context: $X \leftarrow X + P_{1:n}$ , $E^u \leftarrow E^u + P_{1:T}$

HSTU Block (per branch):

Self-attention: $H = \mathrm{LayerNorm}(A(X) + X)$ where

$A(X) = \mathrm{Concat}_h(\mathrm{Softmax}((Q K^\top)/\sqrt{d_k}) V) W^o$

with $Q = X W^q$ , $K = X W^k$ , $V = X W^v$ , etc.

Feed-forward: $X' = \mathrm{LayerNorm}(\mathrm{FFN}(H) + H)$

Cross-attention (Target Attention):

For candidate $i$ $i$ :
- Query: $q = x'_i W^q$
- Keys/values: $K = E^{u'} W^k$ , $V = E^{u'} W^v$
- $\alpha^t = \mathrm{Softmax}(q K^\top / \sqrt{d_k})$
- $x''_i = \alpha^t V$ (multi-head variant analogous)

Pointwise CTR Prediction:

$\hat{y}^p_i = \sigma(\mathrm{MLP}(x''_i))$ for $i=1..n$
$\mathcal{L}_1 = -\sum_{i=1}^{n} [y_i \log \hat{y}^p_i + (1-y_i) \log (1-\hat{y}^p_i)]$

3. Hyper-parameters and Implementation Details

Key UCDT tunables include:

Embedding dimension $D$ (e.g., 64, 128, 256)
Number of HSTU layers (typically $\ell_u=1$ )
Attention heads $h$ (chosen by development set, e.g., 4, 8, 16)
Per-head sub-dimension $d_k = D/h$
Dropout probability $p \in [0.1, 0.3]$

All weights in the user and candidate branches remain disjoint except for the input embedding lookup, which may be shared. The computational complexity is comparable to standard Transformer blocks with linear scaling in $D$ , $h$ , and $n$ .

4. Integration within RIA and Downstream Modules

UCDT is designated as the initial encoding and fusion mechanism within the broader RIA pipeline (Zhang et al., 26 Nov 2025). Its outputs serve as the input for:

CUHT (Context-aware User History and Target) module: Applies further session-level and position-aware refinement via attention on cached HSTU outputs.
LMH (Listwise Multi-HSTU): For each candidate, an adapter MLP is used to transform $x''_i$ into $t_i$ . These are concatenated with context vectors and processed by additional HSTU layers for hierarchical modeling of item dependencies. The final listwise output $m_{I, o}$ is fed through an MLP for listwise pCTR prediction.

The total training objective is the sum of pointwise loss from UCDT and listwise loss from LMH:

$\mathcal{L} = \mathcal{L}_1 + \mathcal{L}_2$

5. Empirical Impact and Comparative Analysis

In large-scale production deployments (e.g., Meituan), integration of UCDT within RIA yields significant gains over baseline models. Notably:

On Avito data: AUC increases from 0.7340 (YOLOR baseline) to 0.7380 with RIA incorporating UCDT.
On Meituan data: AUC rises from 0.6634 to 0.6665.
Live A/B tests: RIA (with UCDT) achieves +1.69% CTR and +4.54% CPM improvements relative to existing production systems.

The reported improvements are attributed to UCDT’s capability for fine-grained user–item-context modeling and seamless bridging between candidate ranking and reranking stages. Note that no explicit “UCDT ablation” is provided; performance gains reflect the aggregated effect of RIA’s modular enhancements (Zhang et al., 26 Nov 2025).

6. Position within the Dual-Encoder and Bi-Encoder Landscape

While UCDT utilizes a dual-branch Transformer mechanism with cross-attention, previous dual-encoder (“bi-encoder”) approaches—e.g., for multilingual job-candidate matching (Lavi, 2021)—employed separate (potentially weight-shared) Transformer encoders for users and candidates, producing joint embedding spaces optimized via contrastive objectives. UCDT distinguishes itself by integrating deep cross-attention after independent branch encoding and using hierarchical blocks specialized for sequential and set-based candidate structures. This architectural progression expands modeling capacity for context, order, and user history while maintaining scalability constraints necessary for industrial recommender systems.

7. Concluding Remarks and Future Directions

UCDT exemplifies a new class of interacting dual-branch architectures in deep ranking, directly addressing the need for joint, contextually conditioned user–candidate representations within large-scale CTR and recommendation pipelines. Its modularity allows it to transfer contextual knowledge efficiently between ranking and reranking phases, with clear empirical benefits in both offline metrics and real-world serving environments. Further investigation of independent ablations, parameter sharing strategies, and adaptation to domains beyond advertising and recommendation remains an open trajectory for research in this class of architectures (Zhang et al., 26 Nov 2025).