Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 465 tok/s Pro
Kimi K2 205 tok/s Pro
2000 character limit reached

RankMixer: Scaling Up Ranking Models in Industrial Recommenders (2507.15551v3)

Published 21 Jul 2025 in cs.IR

Abstract: Recent progress on LLMs has spurred interest in scaling up recommendation systems, yet two practical obstacles remain. First, training and serving cost on industrial Recommenders must respect strict latency bounds and high QPS demands. Second, most human-designed feature-crossing modules in ranking models were inherited from the CPU era and fail to exploit modern GPUs, resulting in low Model Flops Utilization (MFU) and poor scalability. We introduce RankMixer, a hardware-aware model design tailored towards a unified and scalable feature-interaction architecture. RankMixer retains the transformer's high parallelism while replacing quadratic self-attention with multi-head token mixing module for higher efficiency. Besides, RankMixer maintains both the modeling for distinct feature subspaces and cross-feature-space interactions with Per-token FFNs. We further extend it to one billion parameters with a Sparse-MoE variant for higher ROI. A dynamic routing strategy is adapted to address the inadequacy and imbalance of experts training. Experiments show RankMixer's superior scaling abilities on a trillion-scale production dataset. By replacing previously diverse handcrafted low-MFU modules with RankMixer, we boost the model MFU from 4.5\% to 45\%, and scale our ranking model parameters by 100x while maintaining roughly the same inference latency. We verify RankMixer's universality with online A/B tests across two core application scenarios (Recommendation and Advertisement). Finally, we launch 1B Dense-Parameters RankMixer for full traffic serving without increasing the serving cost, which improves user active days by 0.3\% and total in-app usage duration by 1.08\%.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces RankMixer's innovative architecture by combining multi-head token mixing with per-token feed-forward networks to optimize feature interactions.
  • It employs a sparse Mixture-of-Experts framework with dynamic routing, achieving significant AUC gains and improved scalability with minimal computational cost.
  • Experimental evaluations demonstrate that RankMixer can increase parameter capacity by 100x while maintaining comparable inference latency, proving its industrial applicability.

RankMixer: Scaling Up Ranking Models in Industrial Recommenders

Introduction

The field of recommendation systems has seen significant advancements, but scaling them to meet industrial demands poses challenges—especially regarding serving costs, latency, and utilization of modern hardware like GPUs. The "RankMixer" paper introduces a novel architecture aimed at overcoming these practical hurdles by leveraging efficient feature interactions and scaling mechanisms. RankMixer integrates a hardware-aware design philosophy to optimize Model Flops Utilization (MFU) and improve scalability.

RankMixer Architecture

RankMixer’s architecture is founded on two modules: Multi-head Token Mixing and Per-token Feed-Forward Networks (PFFNs).

Multi-head Token Mixing

This module is designed to enhance feature interactions efficiently. Tokens are divided into several heads, allowing information to mix across different feature subspaces without involving costly self-attention mechanisms. Figure 1

Figure 1: The architecture of a RankMixer block showcasing Multi-head Token Mixing and SMoE based Per-token FFN.

Per-token Feed-Forward Networks

PFFNs allow isolated parameter processing for each token, which better caters to the diversity of recommendation features. Unlike shared FFNs in traditional models, per-token FFNs prevent feature domination and preserve modeling capacity.

Sparse Mixture-of-Experts (MoE)

RankMixer employs a Sparse-MoE variant to scale model capacity efficiently. Utilizing a dynamic routing strategy, it selectively activates subsets of experts per token, thus enlarging model capacity with minimal computational cost.

Dynamic Routing Techniques

The combination of ReLU Routing and Dense-training, Sparse-inference significantly ameliorates expert imbalances and starvation issues, ensuring adequate training and utilization of experts. Figure 2

Figure 2: AUC performance of RankMixer variants under decreasingly sparse activation ratios showing maintained accuracy.

Scaling Laws and Efficiency

RankMixer exhibits advantageous scaling properties, validated by its performance metrics against parameters and computational costs.

Scaling Behavior

RankMixer demonstrates superior scalability characterized by steep scaling laws in comparison with other models. Its design balances parameter growth against inference costs, optimizing latency without sacrificing model effectiveness. Figure 3

Figure 3: Scaling laws delineating finish Auc-gain to Params/Flops using a logarithmic scale.

Experimental Evaluation

Extensive tests deployed RankMixer in production scenarios, assessing performance improvement across recommendation, advertising, and search applications.

Offline and Online Metrics

RankMixer showed substantial AUC gains in offline settings and impressive increments in metrics reflecting user engagement in online tests. The architecture’s enhancements are significant in terms of both computational efficiency and effectiveness.

Efficiency Measures

RankMixer achieved remarkable Model Flops Utilization improvements, allowing it to increase parameter counts by 100x while maintaining inference latency comparable to previous baselines. Figure 4

Figure 4: Activated expert ratio demonstrating dynamic token-based activation within RankMixer.

Conclusion

RankMixer presents a promising solution to the challenges faced in scaling industrial recommendation systems. By focusing on hardware-aligned architecture and innovative scaling strategies, it resolves many inefficiencies prevalent in traditional models. Its integration across multiple use-cases demonstrates versatility and substantial performance improvements, paving the way for more efficient and effective recommendation systems tailored to modern hardware architectures.