OneTrans: Unified Transformer for RecSys
- OneTrans is a unified Transformer-based framework that integrates user behavior sequences with static feature interactions into a single flat token sequence.
- It utilizes a unified tokenizer and custom Transformer blocks with mixed parameterization to enable simultaneous sequence and feature modeling.
- System-level innovations like cross-request key/value caching and pyramid stacking drive efficient scaling and statistically significant improvements in key business metrics.
OneTrans is a unified Transformer-based framework for industrial-scale recommender systems that simultaneously models both user-behavior sequences and feature interactions in a single architecture. OneTrans departs from conventional RecSys pipelines that use separate modules for sequence modeling (e.g., user click/purchase histories) and static/contextual feature interaction (e.g., user/item/context), and instead encodes both modalities as a flat token sequence processed by a shared stack of Transformer layers. This approach enables richer cross-modality interactions, more efficient scaling and optimization, and is supported by system-level innovations such as cross-request key/value caching. OneTrans demonstrates superior performance and scaling properties on massive real-world recommendation data, including statistically significant lifts in key business metrics.
1. Background: Feature Interaction and Sequence Modeling in Recommenders
Industrial recommender systems typically employ two separate module families: feature-interaction networks (e.g., Wukong, RankMixer) that model static attributes, and user-behavior sequence models (e.g., LONGER) that process chronological event histories. The "encode-then-interaction" paradigm introduces a modular bottleneck that restricts bidirectional information flow, limits joint optimization, and reduces hardware efficiency when scaling to very large models.
OneTrans unifies these paradigms by constructing a direct 1D token sequence representing both user behavior history and static/contextual features, encoded with a single backbone architecture. This allows sequence behaviors and feature tokens to interact within every Transformer layer, automatically cross-pollinating representations and unlocking joint optimization at scale.
2. Unified Tokenizer and Input Encoding
OneTrans utilizes a unified tokenizer that transforms both sequential and non-sequential attributes into a flat token sequence , where and are the counts of sequential (S-tokens) and non-sequential (NS-tokens) features, respectively.
- S-tokens: Encode synchronous event histories, such as user clicks and purchases, optionally including learnable [SEP] tokens to separate behavior modalities.
- NS-tokens: Map static features (user, item, context) to dedicated tokens. Two strategies are provided:
- Group-wise tokenization: Manual partitioning into feature groups and mapping each to an embedding.
- Auto-Split tokenization: Concatenation of all features, passage through a multi-layer perceptron (MLP), and splitting into token embeddings, which empirical results indicate is optimal.
This encoding allows every input to be processed as a sequence, supporting homogeneous and heterogeneous features with equal efficiency.
3. OneTrans Block Architecture: Parameter Sharing and Causal Attention
OneTrans processes the input token sequence with a stack of custom Transformer layers (OneTrans blocks) implementing "mixed parameterization":
- S-tokens: Share Q/K/V and feed-forward (FFN) parameters across all sequential tokens (reflecting the homogeneous structure of behaviors).
- NS-tokens: Assign token-specific Q/K/V and FFN parameters to model the heterogeneity of static/contextual features.
Mathematically:
Each block applies causal attention:
- S-tokens attend only to previous S-tokens (autoregressive over behaviors).
- NS-tokens attend to all S-tokens and earlier NS-tokens, aggregating behavioral and static context.
Block computation follows:
"Pyramid stacking" progressively reduces S-token count at deeper layers to condense long-term behavioral information, improving compute efficiency.
4. Serving Optimization: Cross-Request KV Caching and LLM Hardware Techniques
In production environments, a single user request yields many recommendation candidates sharing the same sequential context. Cross-request key/value caching separates computation into two stages:
- Stage I: Compute and cache key/value states for all S-tokens once per request.
- Stage II: For each candidate NS-token set, only the required NS-token operations are computed, reusing the cached S-token states.
This reduces serving time complexity from (where is the number of candidates) to per candidate. Incremental updates are supported for append-only behavioral streams.
Further, OneTrans leverages LLM techniques:
- FlashAttention-2 for low-memory, fast attention computation.
- Mixed-precision inference and activation recomputation for higher batch size and deeper models.
- Memory-optimized pyramid stacks.
5. Empirical Results and Scaling Behavior
OneTrans was validated on extensive industrial logs (e.g., 29.1B impressions, 27.9M users), with key metrics including ATR and CVR AUC/UAUC. Compared to strong baselines (DCNv2+DIN, RankMixer+Transformer, LONGER), OneTrans exhibited better scaling, accuracy, and efficiency.
Table: Offline Performance Summary
| Model | ATR AUC | ATR UAUC | CVR AUC | CVR UAUC |
|---|---|---|---|---|
| OneTrans-S | +1.13% | +1.77% | +0.90% | +1.66% |
| OneTrans-L | +1.53% | +2.79% | +1.14% | +3.23% |
Accuracy scales near-log-linearly with depth and width (see scaling law), contrasting previous architectures.
Ablation studies confirm:
- Auto-Split tokenizer is optimal.
- Mixed parameterization beneficial.
- Pyramid stacking aids hardware efficiency with negligible loss in accuracy.
6. Production A/B Test Outcomes and Business Impact
Live A/B tests deployed OneTrans-L in major production scenarios against mature RankMixer+Transformer baselines:
Table: Online Business Metrics
| Scenario | GMV/user | Order/user | Latency | Cold-start (order/user) |
|---|---|---|---|---|
| Feeds | +5.68% | +4.35% | -3.91% | +13.59% |
| Mall | +3.67% | +2.58% | -3.26% | n.a. |
Statistical significance confirmed ( or ). The +5.68% lift in GMV/user in feeds is notable, indicating robust business impact.
Additional observations include strong generalization (e.g., increased user active days) and improved performance in cold-start scenarios.
7. Architectural Diagram
Below is an abstraction based on the structure described:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
+----------------------------------------------+ | OneTrans Overview | +----------------------------------------------+ | User Behavior Sequences ----| | | Context/Item/User Features ----| | | | | | Unified Tokenizer --> [ S-tokens ; NS-tokens ] (1D seq) | | | | | Stack of OneTrans Blocks | | (Pyramid, Mixed ParamS, Causal Attention) | | | | | Prediction Head(s) <-- Cross-Request KV Caching | +----------------------------------------------+ |
8. Significance and Implications
OneTrans establishes a paradigm wherein RecSys ranking and matching tasks are recast as unified Transformer modeling problems, yielding benefits in scalability, accuracy, optimization efficiency, and overall system flexibility. The ability to jointly model sequence and feature interactions enables richer representations and direct exploitation of hardware acceleration techniques. The demonstrated empirical and business impact on large-scale recommender environments positions OneTrans as a reference architecture for future developments integrating large Transformer backbones in industrial recommendation tasks.
A plausible implication is that further advances in recommender system performance may come from increased unification of modalities and cross-pollination from LLM system optimizations, as exemplified by OneTrans (Zhang et al., 30 Oct 2025).