Papers
Topics
Authors
Recent
Search
2000 character limit reached

TokenMixer-Large: Scalable Recommender Backbone

Updated 9 February 2026
  • TokenMixer-Large is a scalable backbone for industrial recommender systems, leveraging symmetric mixing-and-reverting blocks and enhanced residual strategies.
  • It employs Sparse-PerToken MoE and hardware-aware operator fusion to achieve significant ΔAUC improvements and reduced FLOPs in large-scale evaluations.
  • Empirical results show that deeper, sparsely activated models enable efficient training and deployment, scaling up to 15 billion parameters.

TokenMixer-Large is a high-performance, highly scalable backbone for industrial ranking in large-scale recommender systems. It builds upon the initial TokenMixer (RankMixer) architecture, addressing design limitations in residual pathways, model depth, Mixture of Experts (MoE) sparsification, and scalability. TokenMixer-Large introduces a symmetric mixing-and-reverting block, enhanced residual and auxiliary loss strategies, Sparse-PerToken MoE with “Sparse Train–Sparse Infer”, and hardware-aware operator fusion, enabling deployment at scales up to 15 billion parameters with significant improvements in complexity, efficiency, and empirical outcomes on both offline datasets and live traffic at ByteDance (Jiang et al., 6 Feb 2026).

1. Architectural Foundation and Innovations

1.1 Mixing-and-Reverting Operation

TokenMixer-Large modifies the original TokenMixer’s input mixing by implementing a symmetric two-stage block that ensures the input and output of each layer retain dimensions in RT×D\mathbb{R}^{T\times D}, supporting effective residual connection alignment. The core stages are:

  • Mixing: The input XRT×DX \in \mathbb{R}^{T \times D} is split into HH groups, permuted, and concatenated: H=split+permute(X)RH×(TD/H)H = \text{split}+\text{permute}(X) \in \mathbb{R}^{H \times (T D/H)}. A parameter-isolated SwiGLU nonlinearity pSwiGLU(H)\mathrm{pSwiGLU}(H) is applied, followed by RMSNorm and an additive residual: H=RMSNorm(pSwiGLU(H)+H)H' = \operatorname{RMSNorm}(\mathrm{pSwiGLU}(H) + H).
  • Reverting: HH' is reshaped back: Xrevert=reshape(H)RT×DX_{\mathrm{revert}} = \text{reshape}(H') \in \mathbb{R}^{T \times D}, another pSwiGLU and residual: Xnext=RMSNorm(pSwiGLU(Xrevert)+X)X_{\mathrm{next}} = \operatorname{RMSNorm}(\mathrm{pSwiGLU}(X_{\mathrm{revert}}) + X).

This design solves the semantic misalignment inherent to the original TokenMixer, where residuals could not meaningfully propagate unless the layer consistently preserved token counts and mixing structure.

1.2 Inter-Layer Residuals and Auxiliary Loss

TokenMixer-Large supports training of deep stacks (L12L \geq 12) by augmenting per-block residuals with:

  • Interval Residuals: Adding XlkX_{l-k} to XlX_l every kk layers (typically k=2k=2 or $3$), facilitating upward flow of low-level features.
  • Auxiliary Losses: Layer-wise auxiliary classification losses imposed at intermediate logits, i.e.

L=Lmain(fL(XL),y)+λiILaux(fi(Xi),y),\mathcal{L} = \mathcal{L}_{\mathrm{main}}\bigl(f_L(X_L), y\bigr) + \lambda \sum_{i\in \mathcal{I}} \mathcal{L}_{\mathrm{aux}}\bigl(f_i(X_i), y\bigr),

where I\mathcal{I} indexes layers with auxiliary heads and λ\lambda is a small constant (e.g., $0.1$). This mitigates gradient decay and encourages intermediate layers to learn predictive features.

1.3 PerToken SwiGLU

Each token xtx_t receives its own parameter-isolated SwiGLU activation:

pSwiGLU(xt)=Wdownt(Swish(Wgatetxt)(Wuptxt))+bt,\mathrm{pSwiGLU}(x_t) = W_{\text{down}}^{t}\left(\mathrm{Swish}(W_{\text{gate}}^{t}x_t)\odot (W_{\text{up}}^{t}x_t)\right) + b^t,

where Wup,Wgate,WdownW_{\text{up}}, W_{\text{gate}}, W_{\text{down}} are FCs of size DnDD \rightarrow nD. This setup boosts representation heterogeneity at the token level.

1.4 Sparse-PerToken MoE (“S-P MoE”)

Each token’s SwiGLU is replaced by a local MoE comprising EE expert SwiGLUs (of size nD/EnD/E) and a shared expert:

yt=αjTopK(g(xt))gj(xt)Expertj(xt)+SharedExpert(xt)y_t = \alpha\sum_{j\in \text{TopK}(g(x_t))} g_j(x_t)\, \text{Expert}_j(x_t) + \text{SharedExpert}(x_t)

Only kEk \ll E experts are active per token for both training and inference (“Sparse Train, Sparse Infer”). g(xt)g(x_t) is the router softmax, α\alpha compensates for reduced expert utilization, and the shared expert ensures robust convergence. FP8 quantization, fused MoEGroupedGemm kernels, and token-parallel sharding bolster both throughput and scalability.

2. Scaling Paradigm and Training Methodology

TokenMixer-Large expands model size by proportionally increasing embedding width (DD), depth (LL), expansion (nn), and expert count (EE). Achieved configurations include 15B parameters (Feed-Ads, L32,D4096,n4L \approx 32, D \approx 4096, n \approx 4), 7B (E-Commerce), and 2–4B (Live Streaming).

Training regimes employ Adagrad (lrdense=0.01\text{lr}_{\text{dense}} = 0.01, lrsparse=0.05\text{lr}_{\text{sparse}} = 0.05), bfloat16 training, and FP8 inference. Convergence scales with model size: while 30–90M parameter models converge in \sim14 days, 500M–2B require \sim60 days’ data.

Data and Hardware Infrastructure

  • Data: Online deployments used 400M daily Douyin E-Commerce samples (two years), 300M/day Ads, and 17B/day Live-Streaming. Input features comprise sparse IDs, numeric values, and varied user behavior sequences.
  • Hardware: Training utilized 64–256 A100 GPUs. Token-parallel sharding and custom FP8/Fused-MoE operators were essential for both dense and sparse model variants.

3. Empirical Performance and Evaluations

3.1 Offline Metrics

TokenMixer-Large achieves significant AUC improvements at various scales:

Model Variant ΔAUC (CTCVR, E-Com, ∼500M) FLOPs (T)
TokenMixer-Large 500M +0.94% 4.2
RankMixer ±0.84%
AutoInt +0.75%
Wukong +0.76%
HiFormer +0.44%
DCNv2 +0.49%

Scaling up:

  • 4B dense: +1.14% ΔAUC, 29.8T FLOPs
  • 7B dense: +1.20% ΔAUC, 49.0T FLOPs
  • 4.6B S-P MoE (1:2 sparsity): +1.14% ΔAUC, 15.1T FLOPs, 2.3B active parameters

Ablations (all \sim500M params):

  • Removing mixing-and-reverting: –0.27% AUC
  • Removing standard residuals: –0.15%
  • Omitting interval residual plus aux-loss: –0.04%
  • Replacing pertoken SwiGLU with global SwiGLU: –0.21%
  • Switching to standard sparse MoE: –0.10%

3.2 Online A/B Testing

Live deployments replacing RankMixer with TokenMixer-Large yielded:

Scenario Baseline Size TM-Large Size ΔAUC Business Metric Gain
Feed-Ads 1B 7B +0.35% ADSS +2.0%
E-Commerce 150M 4B +0.51% Order count / GMV +1.66%, +2.98%
Live-Streaming 500M 2B +0.70% Total payment amount +1.40%
Douyin App Metrics -- -- -- Active Days / Session / Likes / Finishes / Comments +0.29%, +1.08%, +2.39%, +1.99%, +0.79%

All gains were statistically significant across user-activity segments.

4. Complexity, Efficiency, and Practical Implementation

4.1 FLOPs and Memory

At 500M parameters, TokenMixer-Large requires only 4.2T FLOPs per 2048-sample batch—substantially lower than dense MLPs (~125T) or DCNv2 (~126T), and competitive with Transformer-style models (AutoInt 138T, HiFormer 28.8T). S-P MoE achieves further FLOP reductions by ~50% through sparsity. GPU memory for a 7B model peaks at ~40GB (bfloat16, 4-way token parallelism).

FP8 quantization, fused MoEGroupedGemm kernels, and Token-Parallel sharding enable 1.7× inference speed-ups and efficient multi-GPU scaling.

4.2 Ablation Findings and Model Utilization

  • RMSNorm Pre-Norm is empirically superior (vs. Post-Norm or Sandwich-Norm).
  • Gate-Value Scaling (α1/sparsity\alpha \propto 1/\text{sparsity}) is critical; omission costs 0.02–0.05% ΔAUC.
  • Down-Matrix Small Init (×0.01\times 0.01 on FC_down) provides +0.03% ΔAUC and enhanced convergence over Xavier.
  • “Pure-Model” Composition: Removing small, fragmented operators (e.g., DCN, LHUC, DHEN) at large scale (>>500M) yields identical performance, with Model-FLOPs-Utilization rising from ~30% to 60%.

5. Deployment Guidance and Best Practices

TokenMixer-Large demonstrates several deployment best practices for industrial-scale recommendation backbones:

  • Prioritize semantic alignment in residual design via symmetric mixing/reverting blocks.
  • Combine interval residuals with lightweight auxiliary losses to stabilize deep models.
  • Utilize “first enlarge, then sparse” PerToken-MoE with gate scaling and shared experts for scalable, efficient parameterization.
  • Pursue a pure-model approach for maximizing hardware utilization as model size grows.
  • Leverage FP8 quantization, fused kernels, and token-level parallelization for efficient inference and training.

6. Limitations, Open Problems, and Future Directions

TokenMixer-Large’s Sparse-PerToken MoE experiences load imbalance at higher sparsity ratios (e.g., >>1:8); more effective router losses may address this. Training extremely large models (>>15B) is data-intensive, demanding substantially longer logging periods (weeks to months). Extending TokenMixer-Large to multimodal or sequential recommendation remains an open direction with yet unexplored empirical outcomes (Jiang et al., 6 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TokenMixer-Large.