Papers
Topics
Authors
Recent
2000 character limit reached

OneTrans: Unified Transformer for RecSys

Updated 31 October 2025
  • OneTrans is a unified Transformer-based framework that integrates user behavior sequences with static feature interactions into a single flat token sequence.
  • It utilizes a unified tokenizer and custom Transformer blocks with mixed parameterization to enable simultaneous sequence and feature modeling.
  • System-level innovations like cross-request key/value caching and pyramid stacking drive efficient scaling and statistically significant improvements in key business metrics.

OneTrans is a unified Transformer-based framework for industrial-scale recommender systems that simultaneously models both user-behavior sequences and feature interactions in a single architecture. OneTrans departs from conventional RecSys pipelines that use separate modules for sequence modeling (e.g., user click/purchase histories) and static/contextual feature interaction (e.g., user/item/context), and instead encodes both modalities as a flat token sequence processed by a shared stack of Transformer layers. This approach enables richer cross-modality interactions, more efficient scaling and optimization, and is supported by system-level innovations such as cross-request key/value caching. OneTrans demonstrates superior performance and scaling properties on massive real-world recommendation data, including statistically significant lifts in key business metrics.

1. Background: Feature Interaction and Sequence Modeling in Recommenders

Industrial recommender systems typically employ two separate module families: feature-interaction networks (e.g., Wukong, RankMixer) that model static attributes, and user-behavior sequence models (e.g., LONGER) that process chronological event histories. The "encode-then-interaction" paradigm introduces a modular bottleneck that restricts bidirectional information flow, limits joint optimization, and reduces hardware efficiency when scaling to very large models.

OneTrans unifies these paradigms by constructing a direct 1D token sequence representing both user behavior history and static/contextual features, encoded with a single backbone architecture. This allows sequence behaviors and feature tokens to interact within every Transformer layer, automatically cross-pollinating representations and unlocking joint optimization at scale.

2. Unified Tokenizer and Input Encoding

OneTrans utilizes a unified tokenizer that transforms both sequential and non-sequential attributes into a flat token sequence X(0)∈R(LS+LNS)×d\mathbf{X}^{(0)} \in \mathbb{R}^{(L_{\mathrm{S}} + L_{\mathrm{NS}})\times d}, where LSL_{\mathrm{S}} and LNSL_{\mathrm{NS}} are the counts of sequential (S-tokens) and non-sequential (NS-tokens) features, respectively.

  • S-tokens: Encode synchronous event histories, such as user clicks and purchases, optionally including learnable [SEP] tokens to separate behavior modalities.
  • NS-tokens: Map static features (user, item, context) to dedicated tokens. Two strategies are provided:
    • Group-wise tokenization: Manual partitioning into feature groups and mapping each to an embedding.
    • Auto-Split tokenization: Concatenation of all features, passage through a multi-layer perceptron (MLP), and splitting into token embeddings, which empirical results indicate is optimal.

This encoding allows every input to be processed as a sequence, supporting homogeneous and heterogeneous features with equal efficiency.

3. OneTrans Block Architecture: Parameter Sharing and Causal Attention

OneTrans processes the input token sequence with a stack of custom Transformer layers (OneTrans blocks) implementing "mixed parameterization":

  • S-tokens: Share Q/K/V and feed-forward (FFN) parameters across all sequential tokens (reflecting the homogeneous structure of behaviors).
  • NS-tokens: Assign token-specific Q/K/V and FFN parameters to model the heterogeneity of static/contextual features.

Mathematically: WiΨ={WSΨ,i≤LS WNS,iΨ,i>LSfor Ψ∈{Q,K,V}\mathbf{W}^{\Psi}_i = \begin{cases} \mathbf{W}^{\Psi}_{\mathrm{S}}, & i \leq L_{\mathrm{S}} \ \mathbf{W}^{\Psi}_{\mathrm{NS},i}, & i > L_{\mathrm{S}} \end{cases} \quad \text{for } \Psi \in \{Q, K, V\}

Each block applies causal attention:

  • S-tokens attend only to previous S-tokens (autoregressive over behaviors).
  • NS-tokens attend to all S-tokens and earlier NS-tokens, aggregating behavioral and static context.

Block computation follows: Z(n)=MixedMHA(Norm(X(n−1)))+X(n−1) X(n)=MixedFFN(Norm(Z(n)))+Z(n)\begin{aligned} \mathbf{Z}^{(n)} &= \mathrm{MixedMHA}\left( \mathrm{Norm}\left(\mathbf{X}^{(n-1)}\right) \right) + \mathbf{X}^{(n-1)} \ \mathbf{X}^{(n)} &= \mathrm{MixedFFN}\left( \mathrm{Norm}\left(\mathbf{Z}^{(n)}\right) \right) + \mathbf{Z}^{(n)} \end{aligned}

"Pyramid stacking" progressively reduces S-token count at deeper layers to condense long-term behavioral information, improving compute efficiency.

4. Serving Optimization: Cross-Request KV Caching and LLM Hardware Techniques

In production environments, a single user request yields many recommendation candidates sharing the same sequential context. Cross-request key/value caching separates computation into two stages:

  • Stage I: Compute and cache key/value states for all S-tokens once per request.
  • Stage II: For each candidate NS-token set, only the required NS-token operations are computed, reusing the cached S-token states.

This reduces serving time complexity from O(C)O(C) (where CC is the number of candidates) to O(1)O(1) per candidate. Incremental updates are supported for append-only behavioral streams.

Further, OneTrans leverages LLM techniques:

  • FlashAttention-2 for low-memory, fast attention computation.
  • Mixed-precision inference and activation recomputation for higher batch size and deeper models.
  • Memory-optimized pyramid stacks.

5. Empirical Results and Scaling Behavior

OneTrans was validated on extensive industrial logs (e.g., 29.1B impressions, 27.9M users), with key metrics including ATR and CVR AUC/UAUC. Compared to strong baselines (DCNv2+DIN, RankMixer+Transformer, LONGER), OneTrans exhibited better scaling, accuracy, and efficiency.

Table: Offline Performance Summary

Model ATR AUC ATR UAUC CVR AUC CVR UAUC
OneTrans-S +1.13% +1.77% +0.90% +1.66%
OneTrans-L +1.53% +2.79% +1.14% +3.23%

Accuracy scales near-log-linearly with depth and width (see scaling law), contrasting previous architectures.

Ablation studies confirm:

  • Auto-Split tokenizer is optimal.
  • Mixed parameterization beneficial.
  • Pyramid stacking aids hardware efficiency with negligible loss in accuracy.

6. Production A/B Test Outcomes and Business Impact

Live A/B tests deployed OneTrans-L in major production scenarios against mature RankMixer+Transformer baselines:

Table: Online Business Metrics

Scenario GMV/user Order/user Latency Cold-start (order/user)
Feeds +5.68% +4.35% -3.91% +13.59%
Mall +3.67% +2.58% -3.26% n.a.

Statistical significance confirmed (p<0.05p<0.05 or p<0.01p<0.01). The +5.68% lift in GMV/user in feeds is notable, indicating robust business impact.

Additional observations include strong generalization (e.g., increased user active days) and improved performance in cold-start scenarios.

7. Architectural Diagram

Below is an abstraction based on the structure described:

1
2
3
4
5
6
7
8
9
10
11
12
13
+----------------------------------------------+
|             OneTrans Overview                |
+----------------------------------------------+
| User Behavior Sequences  ----|               |
| Context/Item/User Features   ----|           |
|                                |             |
|  Unified Tokenizer --> [ S-tokens ; NS-tokens ]  (1D seq)     |
|              |                                   |
|        Stack of OneTrans Blocks                  |
|        (Pyramid, Mixed ParamS, Causal Attention) |
|              |                                   |
|         Prediction Head(s)       <-- Cross-Request KV Caching |
+----------------------------------------------+

8. Significance and Implications

OneTrans establishes a paradigm wherein RecSys ranking and matching tasks are recast as unified Transformer modeling problems, yielding benefits in scalability, accuracy, optimization efficiency, and overall system flexibility. The ability to jointly model sequence and feature interactions enables richer representations and direct exploitation of hardware acceleration techniques. The demonstrated empirical and business impact on large-scale recommender environments positions OneTrans as a reference architecture for future developments integrating large Transformer backbones in industrial recommendation tasks.

A plausible implication is that further advances in recommender system performance may come from increased unification of modalities and cross-pollination from LLM system optimizations, as exemplified by OneTrans (Zhang et al., 30 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to OneTrans.