TokenMixer-Large: Scalable Recommender Backbone
- TokenMixer-Large is a scalable backbone for industrial recommender systems, leveraging symmetric mixing-and-reverting blocks and enhanced residual strategies.
- It employs Sparse-PerToken MoE and hardware-aware operator fusion to achieve significant ΔAUC improvements and reduced FLOPs in large-scale evaluations.
- Empirical results show that deeper, sparsely activated models enable efficient training and deployment, scaling up to 15 billion parameters.
TokenMixer-Large is a high-performance, highly scalable backbone for industrial ranking in large-scale recommender systems. It builds upon the initial TokenMixer (RankMixer) architecture, addressing design limitations in residual pathways, model depth, Mixture of Experts (MoE) sparsification, and scalability. TokenMixer-Large introduces a symmetric mixing-and-reverting block, enhanced residual and auxiliary loss strategies, Sparse-PerToken MoE with “Sparse Train–Sparse Infer”, and hardware-aware operator fusion, enabling deployment at scales up to 15 billion parameters with significant improvements in complexity, efficiency, and empirical outcomes on both offline datasets and live traffic at ByteDance (Jiang et al., 6 Feb 2026).
1. Architectural Foundation and Innovations
1.1 Mixing-and-Reverting Operation
TokenMixer-Large modifies the original TokenMixer’s input mixing by implementing a symmetric two-stage block that ensures the input and output of each layer retain dimensions in , supporting effective residual connection alignment. The core stages are:
- Mixing: The input is split into groups, permuted, and concatenated: . A parameter-isolated SwiGLU nonlinearity is applied, followed by RMSNorm and an additive residual: .
- Reverting: is reshaped back: , another pSwiGLU and residual: .
This design solves the semantic misalignment inherent to the original TokenMixer, where residuals could not meaningfully propagate unless the layer consistently preserved token counts and mixing structure.
1.2 Inter-Layer Residuals and Auxiliary Loss
TokenMixer-Large supports training of deep stacks () by augmenting per-block residuals with:
- Interval Residuals: Adding to every layers (typically or $3$), facilitating upward flow of low-level features.
- Auxiliary Losses: Layer-wise auxiliary classification losses imposed at intermediate logits, i.e.
where indexes layers with auxiliary heads and is a small constant (e.g., $0.1$). This mitigates gradient decay and encourages intermediate layers to learn predictive features.
1.3 PerToken SwiGLU
Each token receives its own parameter-isolated SwiGLU activation:
where are FCs of size . This setup boosts representation heterogeneity at the token level.
1.4 Sparse-PerToken MoE (“S-P MoE”)
Each token’s SwiGLU is replaced by a local MoE comprising expert SwiGLUs (of size ) and a shared expert:
Only experts are active per token for both training and inference (“Sparse Train, Sparse Infer”). is the router softmax, compensates for reduced expert utilization, and the shared expert ensures robust convergence. FP8 quantization, fused MoEGroupedGemm kernels, and token-parallel sharding bolster both throughput and scalability.
2. Scaling Paradigm and Training Methodology
TokenMixer-Large expands model size by proportionally increasing embedding width (), depth (), expansion (), and expert count (). Achieved configurations include 15B parameters (Feed-Ads, ), 7B (E-Commerce), and 2–4B (Live Streaming).
Training regimes employ Adagrad (, ), bfloat16 training, and FP8 inference. Convergence scales with model size: while 30–90M parameter models converge in 14 days, 500M–2B require 60 days’ data.
Data and Hardware Infrastructure
- Data: Online deployments used 400M daily Douyin E-Commerce samples (two years), 300M/day Ads, and 17B/day Live-Streaming. Input features comprise sparse IDs, numeric values, and varied user behavior sequences.
- Hardware: Training utilized 64–256 A100 GPUs. Token-parallel sharding and custom FP8/Fused-MoE operators were essential for both dense and sparse model variants.
3. Empirical Performance and Evaluations
3.1 Offline Metrics
TokenMixer-Large achieves significant AUC improvements at various scales:
| Model Variant | ΔAUC (CTCVR, E-Com, ∼500M) | FLOPs (T) |
|---|---|---|
| TokenMixer-Large 500M | +0.94% | 4.2 |
| RankMixer | ±0.84% | |
| AutoInt | +0.75% | |
| Wukong | +0.76% | |
| HiFormer | +0.44% | |
| DCNv2 | +0.49% |
Scaling up:
- 4B dense: +1.14% ΔAUC, 29.8T FLOPs
- 7B dense: +1.20% ΔAUC, 49.0T FLOPs
- 4.6B S-P MoE (1:2 sparsity): +1.14% ΔAUC, 15.1T FLOPs, 2.3B active parameters
Ablations (all 500M params):
- Removing mixing-and-reverting: –0.27% AUC
- Removing standard residuals: –0.15%
- Omitting interval residual plus aux-loss: –0.04%
- Replacing pertoken SwiGLU with global SwiGLU: –0.21%
- Switching to standard sparse MoE: –0.10%
3.2 Online A/B Testing
Live deployments replacing RankMixer with TokenMixer-Large yielded:
| Scenario | Baseline Size | TM-Large Size | ΔAUC | Business Metric | Gain |
|---|---|---|---|---|---|
| Feed-Ads | 1B | 7B | +0.35% | ADSS | +2.0% |
| E-Commerce | 150M | 4B | +0.51% | Order count / GMV | +1.66%, +2.98% |
| Live-Streaming | 500M | 2B | +0.70% | Total payment amount | +1.40% |
| Douyin App Metrics | -- | -- | -- | Active Days / Session / Likes / Finishes / Comments | +0.29%, +1.08%, +2.39%, +1.99%, +0.79% |
All gains were statistically significant across user-activity segments.
4. Complexity, Efficiency, and Practical Implementation
4.1 FLOPs and Memory
At 500M parameters, TokenMixer-Large requires only 4.2T FLOPs per 2048-sample batch—substantially lower than dense MLPs (~125T) or DCNv2 (~126T), and competitive with Transformer-style models (AutoInt 138T, HiFormer 28.8T). S-P MoE achieves further FLOP reductions by ~50% through sparsity. GPU memory for a 7B model peaks at ~40GB (bfloat16, 4-way token parallelism).
FP8 quantization, fused MoEGroupedGemm kernels, and Token-Parallel sharding enable 1.7× inference speed-ups and efficient multi-GPU scaling.
4.2 Ablation Findings and Model Utilization
- RMSNorm Pre-Norm is empirically superior (vs. Post-Norm or Sandwich-Norm).
- Gate-Value Scaling () is critical; omission costs 0.02–0.05% ΔAUC.
- Down-Matrix Small Init ( on FC_down) provides +0.03% ΔAUC and enhanced convergence over Xavier.
- “Pure-Model” Composition: Removing small, fragmented operators (e.g., DCN, LHUC, DHEN) at large scale (500M) yields identical performance, with Model-FLOPs-Utilization rising from ~30% to 60%.
5. Deployment Guidance and Best Practices
TokenMixer-Large demonstrates several deployment best practices for industrial-scale recommendation backbones:
- Prioritize semantic alignment in residual design via symmetric mixing/reverting blocks.
- Combine interval residuals with lightweight auxiliary losses to stabilize deep models.
- Utilize “first enlarge, then sparse” PerToken-MoE with gate scaling and shared experts for scalable, efficient parameterization.
- Pursue a pure-model approach for maximizing hardware utilization as model size grows.
- Leverage FP8 quantization, fused kernels, and token-level parallelization for efficient inference and training.
6. Limitations, Open Problems, and Future Directions
TokenMixer-Large’s Sparse-PerToken MoE experiences load imbalance at higher sparsity ratios (e.g., 1:8); more effective router losses may address this. Training extremely large models (15B) is data-intensive, demanding substantially longer logging periods (weeks to months). Extending TokenMixer-Large to multimodal or sequential recommendation remains an open direction with yet unexplored empirical outcomes (Jiang et al., 6 Feb 2026).