Achieving >1:8 Sparsity Without Performance Loss in Sparse-Pertoken MoE

Determine whether the Sparse-Pertoken Mixture-of-Experts (S-P MoE) component in the TokenMixer-Large architecture can achieve sparsity greater than 1:8 while maintaining performance without loss.

Background

The paper introduces Sparse-Pertoken MoE to reduce training and inference costs while preserving performance. Empirically, the authors report near-zero performance drop at a sparsity ratio of 1:2 and a slight decrease at 1:4, leading them to deploy the 1:2 setting online for best ROI.

They note that increasing sparsity reduces the GEMM component in SwiGLU, suggesting that higher sparsity may require larger models or higher GEMM ratios. The feasibility of pushing sparsity beyond 1:8 without performance degradation remains unresolved.

References

Whether we can achieve sparsity greater than 1:8 while maintaining performance without loss is still under exploration.

TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders  (2602.06563 - Jiang et al., 6 Feb 2026) in Appendix, Section "First Enlarge Then Sparse"