TokenMixer-Large: Scalable Recommender Backbone

Updated 9 February 2026

TokenMixer-Large is a scalable backbone for industrial recommender systems, leveraging symmetric mixing-and-reverting blocks and enhanced residual strategies.
It employs Sparse-PerToken MoE and hardware-aware operator fusion to achieve significant ΔAUC improvements and reduced FLOPs in large-scale evaluations.
Empirical results show that deeper, sparsely activated models enable efficient training and deployment, scaling up to 15 billion parameters.

TokenMixer-Large is a high-performance, highly scalable backbone for industrial ranking in large-scale recommender systems. It builds upon the initial TokenMixer (RankMixer) architecture, addressing design limitations in residual pathways, model depth, Mixture of Experts (MoE) sparsification, and scalability. TokenMixer-Large introduces a symmetric mixing-and-reverting block, enhanced residual and auxiliary loss strategies, Sparse-PerToken MoE with “Sparse Train–Sparse Infer”, and hardware-aware operator fusion, enabling deployment at scales up to 15 billion parameters with significant improvements in complexity, efficiency, and empirical outcomes on both offline datasets and live traffic at ByteDance (Jiang et al., 6 Feb 2026).

1. Architectural Foundation and Innovations

1.1 Mixing-and-Reverting Operation

TokenMixer-Large modifies the original TokenMixer’s input mixing by implementing a symmetric two-stage block that ensures the input and output of each layer retain dimensions in $\mathbb{R}^{T\times D}$ , supporting effective residual connection alignment. The core stages are:

Mixing: The input $X \in \mathbb{R}^{T \times D}$ is split into $H$ groups, permuted, and concatenated: $H = \text{split}+\text{permute}(X) \in \mathbb{R}^{H \times (T D/H)}$ . A parameter-isolated SwiGLU nonlinearity $\mathrm{pSwiGLU}(H)$ is applied, followed by RMSNorm and an additive residual: $H' = \operatorname{RMSNorm}(\mathrm{pSwiGLU}(H) + H)$ .
Reverting: $H'$ is reshaped back: $X_{\mathrm{revert}} = \text{reshape}(H') \in \mathbb{R}^{T \times D}$ , another pSwiGLU and residual: $X_{\mathrm{next}} = \operatorname{RMSNorm}(\mathrm{pSwiGLU}(X_{\mathrm{revert}}) + X)$ .

This design solves the semantic misalignment inherent to the original TokenMixer, where residuals could not meaningfully propagate unless the layer consistently preserved token counts and mixing structure.

1.2 Inter-Layer Residuals and Auxiliary Loss

TokenMixer-Large supports training of deep stacks ( $L \geq 12$ ) by augmenting per-block residuals with:

Interval Residuals: Adding $X_{l-k}$ to $X_l$ every $k$ layers (typically $k=2$ or $3$), facilitating upward flow of low-level features.
Auxiliary Losses: Layer-wise auxiliary classification losses imposed at intermediate logits, i.e.

$\mathcal{L} = \mathcal{L}_{\mathrm{main}}\bigl(f_L(X_L), y\bigr) + \lambda \sum_{i\in \mathcal{I}} \mathcal{L}_{\mathrm{aux}}\bigl(f_i(X_i), y\bigr),$

where $\mathcal{I}$ indexes layers with auxiliary heads and $\lambda$ is a small constant (e.g., $0.1$). This mitigates gradient decay and encourages intermediate layers to learn predictive features.

1.3 PerToken SwiGLU

Each token $x_t$ receives its own parameter-isolated SwiGLU activation:

$\mathrm{pSwiGLU}(x_t) = W_{\text{down}}^{t}\left(\mathrm{Swish}(W_{\text{gate}}^{t}x_t)\odot (W_{\text{up}}^{t}x_t)\right) + b^t,$

where $W_{\text{up}}, W_{\text{gate}}, W_{\text{down}}$ are FCs of size $D \rightarrow nD$ . This setup boosts representation heterogeneity at the token level.

1.4 Sparse-PerToken MoE (“S-P MoE”)

Each token’s SwiGLU is replaced by a local MoE comprising $E$ expert SwiGLUs (of size $nD/E$ ) and a shared expert:

$y_t = \alpha\sum_{j\in \text{TopK}(g(x_t))} g_j(x_t)\, \text{Expert}_j(x_t) + \text{SharedExpert}(x_t)$

Only $k \ll E$ experts are active per token for both training and inference (“Sparse Train, Sparse Infer”). $g(x_t)$ is the router softmax, $\alpha$ compensates for reduced expert utilization, and the shared expert ensures robust convergence. FP8 quantization, fused MoEGroupedGemm kernels, and token-parallel sharding bolster both throughput and scalability.

2. Scaling Paradigm and Training Methodology

TokenMixer-Large expands model size by proportionally increasing embedding width ( $D$ ), depth ( $L$ ), expansion ( $n$ ), and expert count ( $E$ ). Achieved configurations include 15B parameters (Feed-Ads, $L \approx 32, D \approx 4096, n \approx 4$ ), 7B (E-Commerce), and 2–4B (Live Streaming).

Training regimes employ Adagrad ( $\text{lr}_{\text{dense}} = 0.01$ , $\text{lr}_{\text{sparse}} = 0.05$ ), bfloat16 training, and FP8 inference. Convergence scales with model size: while 30–90M parameter models converge in $\sim$ 14 days, 500M–2B require $\sim$ 60 days’ data.

Data and Hardware Infrastructure

Data: Online deployments used 400M daily Douyin E-Commerce samples (two years), 300M/day Ads, and 17B/day Live-Streaming. Input features comprise sparse IDs, numeric values, and varied user behavior sequences.
Hardware: Training utilized 64–256 A100 GPUs. Token-parallel sharding and custom FP8/Fused-MoE operators were essential for both dense and sparse model variants.

3. Empirical Performance and Evaluations

3.1 Offline Metrics

TokenMixer-Large achieves significant AUC improvements at various scales:

Model Variant	ΔAUC (CTCVR, E-Com, ∼500M)	FLOPs (T)
TokenMixer-Large 500M	+0.94%	4.2
RankMixer	±0.84%
AutoInt	+0.75%
Wukong	+0.76%
HiFormer	+0.44%
DCNv2	+0.49%

Scaling up:

4B dense: +1.14% ΔAUC, 29.8T FLOPs
7B dense: +1.20% ΔAUC, 49.0T FLOPs
4.6B S-P MoE (1:2 sparsity): +1.14% ΔAUC, 15.1T FLOPs, 2.3B active parameters

Ablations (all $\sim$ 500M params):

Removing mixing-and-reverting: –0.27% AUC
Removing standard residuals: –0.15%
Omitting interval residual plus aux-loss: –0.04%
Replacing pertoken SwiGLU with global SwiGLU: –0.21%
Switching to standard sparse MoE: –0.10%

3.2 Online A/B Testing

Live deployments replacing RankMixer with TokenMixer-Large yielded:

Scenario	Baseline Size	TM-Large Size	ΔAUC	Business Metric	Gain
Feed-Ads	1B	7B	+0.35%	ADSS	+2.0%
E-Commerce	150M	4B	+0.51%	Order count / GMV	+1.66%, +2.98%
Live-Streaming	500M	2B	+0.70%	Total payment amount	+1.40%
Douyin App Metrics	--	--	--	Active Days / Session / Likes / Finishes / Comments	+0.29%, +1.08%, +2.39%, +1.99%, +0.79%

All gains were statistically significant across user-activity segments.

4. Complexity, Efficiency, and Practical Implementation

4.1 FLOPs and Memory

At 500M parameters, TokenMixer-Large requires only 4.2T FLOPs per 2048-sample batch—substantially lower than dense MLPs (~125T) or DCNv2 (~126T), and competitive with Transformer-style models (AutoInt 138T, HiFormer 28.8T). S-P MoE achieves further FLOP reductions by ~50% through sparsity. GPU memory for a 7B model peaks at ~40GB (bfloat16, 4-way token parallelism).

FP8 quantization, fused MoEGroupedGemm kernels, and Token-Parallel sharding enable 1.7× inference speed-ups and efficient multi-GPU scaling.

4.2 Ablation Findings and Model Utilization

RMSNorm Pre-Norm is empirically superior (vs. Post-Norm or Sandwich-Norm).
Gate-Value Scaling ( $\alpha \propto 1/\text{sparsity}$ ) is critical; omission costs 0.02–0.05% ΔAUC.
Down-Matrix Small Init ( $\times 0.01$ on FC_down) provides +0.03% ΔAUC and enhanced convergence over Xavier.
“Pure-Model” Composition: Removing small, fragmented operators (e.g., DCN, LHUC, DHEN) at large scale ( $>$ 500M) yields identical performance, with Model-FLOPs-Utilization rising from ~30% to 60%.

5. Deployment Guidance and Best Practices

TokenMixer-Large demonstrates several deployment best practices for industrial-scale recommendation backbones:

Prioritize semantic alignment in residual design via symmetric mixing/reverting blocks.
Combine interval residuals with lightweight auxiliary losses to stabilize deep models.
Utilize “first enlarge, then sparse” PerToken-MoE with gate scaling and shared experts for scalable, efficient parameterization.
Pursue a pure-model approach for maximizing hardware utilization as model size grows.
Leverage FP8 quantization, fused kernels, and token-level parallelization for efficient inference and training.

6. Limitations, Open Problems, and Future Directions

TokenMixer-Large’s Sparse-PerToken MoE experiences load imbalance at higher sparsity ratios (e.g., $>$ 1:8); more effective router losses may address this. Training extremely large models ( $>$ 15B) is data-intensive, demanding substantially longer logging periods (weeks to months). Extending TokenMixer-Large to multimodal or sequential recommendation remains an open direction with yet unexplored empirical outcomes (Jiang et al., 6 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TokenMixer-Large.

TokenMixer-Large: Scalable Recommender Backbone

1. Architectural Foundation and Innovations

1.1 Mixing-and-Reverting Operation

1.2 Inter-Layer Residuals and Auxiliary Loss

1.3 PerToken SwiGLU

1.4 Sparse-PerToken MoE (“S-P MoE”)

2. Scaling Paradigm and Training Methodology

Data and Hardware Infrastructure

3. Empirical Performance and Evaluations

3.1 Offline Metrics

3.2 Online A/B Testing

4. Complexity, Efficiency, and Practical Implementation

4.1 FLOPs and Memory

4.2 Ablation Findings and Model Utilization

5. Deployment Guidance and Best Practices

6. Limitations, Open Problems, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TokenMixer-Large: Scalable Recommender Backbone

1. Architectural Foundation and Innovations

1.1 Mixing-and-Reverting Operation

1.2 Inter-Layer Residuals and Auxiliary Loss

1.3 PerToken SwiGLU

1.4 Sparse-PerToken MoE (“S-P MoE”)

2. Scaling Paradigm and Training Methodology

Data and Hardware Infrastructure

3. Empirical Performance and Evaluations

3.1 Offline Metrics

3.2 Online A/B Testing

4. Complexity, Efficiency, and Practical Implementation

4.1 FLOPs and Memory

4.2 Ablation Findings and Model Utilization

5. Deployment Guidance and Best Practices

6. Limitations, Open Problems, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research