Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 40 tok/s

GPT-5 High 38 tok/s Pro

GPT-4o 101 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 161 tok/s Pro

2000 character limit reached

OneRec Technical Report (2506.13695v2)

Published 16 Jun 2025 in cs.IR

Abstract: Recommender systems have been widely used in various large-scale user-oriented platforms for many years. However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimization inconsistencies, and hindering the effective application of key breakthrough technologies from the AI community in recommendation scenarios. To address these issues, we propose OneRec, which reshapes the recommendation system through an end-to-end generative approach and achieves promising results. Firstly, we have enhanced the computational FLOPs of the current recommendation model by 10 $\times$ and have identified the scaling laws for recommendations within certain boundaries. Secondly, reinforcement learning techniques, previously difficult to apply for optimizing recommendations, show significant potential in this framework. Lastly, through infrastructure optimizations, we have achieved 23.7% and 28.8% Model FLOPs Utilization (MFU) on flagship GPUs during training and inference, respectively, aligning closely with the LLM community. This architecture significantly reduces communication and storage overhead, resulting in operating expense that is only 10.6% of traditional recommendation pipelines. Deployed in Kuaishou/Kuaishou Lite APP, it handles 25% of total queries per second, enhancing overall App Stay Time by 0.54% and 1.24%, respectively. Additionally, we have observed significant increases in metrics such as 7-day Lifetime, which is a crucial indicator of recommendation experience. We also provide practical lessons and insights derived from developing, optimizing, and maintaining a production-scale recommendation system with significant real-world impact.

Collections

Summary

The paper presents an end-to-end generative recommendation system that unifies retrieval and ranking, overcoming cascaded architectures with improved MFU and reduced OPEX.
It employs collaborative-aware multimodal tokenization using QFormer and RQ-Kmeans to generate robust semantic IDs, enabling efficient handling of billion-scale item spaces.
The architecture integrates multi-scale user modeling, a Mixture-of-Experts decoder, and RL-based reward optimization to achieve significant gains in both offline and online performance.

OneRec: End-to-End Generative Recommendation System Architecture and Scaling

Motivation and Systemic Limitations of Cascaded Recommendation

The OneRec Technical Report addresses fundamental inefficiencies in traditional multi-stage recommender system architectures, which rely on cascaded retrieval, pre-ranking, and ranking modules. These legacy systems suffer from fragmented compute, low Model FLOPs Utilization (MFU), and optimization inconsistencies due to conflicting objectives and cross-stage modeling. The paper demonstrates that, in production environments such as Kuaishou, over half of serving resources are consumed by communication and storage rather than high-precision computation, with MFU values for ranking models at only 4.6% (training) and 11.2% (inference), far below the 40% observed in LLMs on H100 GPUs.

Figure 1: Online performance, FLOPs, OPEX, and MFU comparison.

The architectural gap has also hindered the adoption of recent advances in scaling laws, RL, and multimodal modeling from the broader AI community. OneRec proposes a unified, end-to-end generative framework that integrates retrieval and ranking, enabling direct optimization for final objectives and substantially improving computational efficiency.

Figure 2: Comparison between a cascaded recommender system and the OneRec encoder-decoder architecture.

Tokenization: Collaborative-Aware Multimodal Semantic IDs

OneRec introduces a scalable tokenization pipeline for short videos, leveraging collaborative-aware multimodal representations. Unlike prior approaches that generate semantic IDs solely from context features, OneRec aligns multimodal content (caption, tag, ASR, OCR, cover image, sampled frames) with collaborative signals using QFormer compression and item-to-item contrastive loss. The tokenization employs RQ-Kmeans for residual quantization, producing coarse-to-fine semantic IDs with improved reconstruction quality, codebook utilization, and token distribution entropy compared to RQ-VAE.

Figure 3: Tokenizer implementation: collaborative multimodal alignment and RQ-Kmeans tokenization.

This approach enables knowledge transfer among similar items and robust generalization to new items, supporting billion-scale item spaces with a fixed vocabulary.

Encoder-Decoder Architecture and Multi-Scale User Modeling

The encoder integrates multi-scale user behavior via four pathways: static, short-term, positive-feedback, and lifelong. Lifelong sequences (up to 100,000 interactions) are hierarchically compressed using K-means and QFormer, enabling efficient modeling of ultra-long user histories. The encoder concatenates all pathway outputs and processes them through transformer layers with RMSNorm.

The decoder adopts a point-wise generation paradigm, using semantic IDs as targets. It incorporates Mixture-of-Experts (MoE) feed-forward networks with top- $k$ routing and loss-free load balancing, scaling model capacity without gradient interference. Training uses cross-entropy loss for next-token prediction on semantic IDs.

Figure 4: Encoder-decoder architecture integrating multi-scale user features and MoE decoder.

Reinforcement Learning and Reward System Design

OneRec's reward system comprises three components: Preference Reward (P-Score), Format Reward, and Industrial Reward. The P-Score is a neural fusion of multiple engagement objectives (clicks, likes, comments, watch time), learned via multi-tower MLPs for personalized preference alignment. Format Reward regularizes the legality of generated semantic ID sequences, mitigating the squeezing effect observed when RL increases the probability of illegal outputs. Industrial Reward enables targeted optimization for business constraints, such as suppressing viral content farms.

Figure 5: Reward system: preference, format, and industrial alignment modules.

The RL framework employs Early Clipped GRPO (ECPO), a modification of Group Policy Relative Optimization, to stabilize training by clipping large policy ratios for negative advantages, preventing gradient explosion. RL and supervised fine-tuning are performed jointly, with RL samples generated via external inference services and rewards computed on-the-fly.

Figure 6: ECPO illustration: early clipping for negative advantages stabilizes policy updates.

Figure 7: Squeezing effect: RL can compress probability mass into illegal tokens without format regularization.

Training Infrastructure and Scaling Laws

OneRec is trained on 90 servers with 8 flagship GPUs each, using NVLink and RDMA for high-bandwidth communication. Embedding acceleration is achieved via GPU-based parameter servers and unified embedding tables. Training employs data parallelism, ZERO1, gradient accumulation, and mixed precision (BFloat16). Compilation optimizations for attention networks further reduce overhead.

The system achieves 23.7% MFU in training and 28.8% in inference, a 5.2 $\times$ and 2.6 $\times$ improvement over legacy models, with OPEX reduced to 10.6% of traditional pipelines.

Empirical Scaling: Model, Feature, Codebook, and Inference

Parameter scaling experiments show that larger models (up to 2.633B parameters) achieve lower loss and improved convergence. Feature scaling demonstrates that comprehensive feature engineering yields substantial improvements in all engagement metrics and preference scores.

Figure 8: Loss curves for different OneRec model sizes, showing scaling behavior.

Figure 9: Training loss and performance improvement with additional features.

Codebook scaling (from 8K to 32K) improves playtime and interaction metrics, while inference scaling (Pass@K from 8 to 512) yields consistent gains, with diminishing returns beyond K=512. Semantic identifier input representation matches sparse embedding performance at scale, with advantages in parameter efficiency, communication, and sequence capacity.

Figure 10: Training loss and performance comparison: semantic identifier vs. sparse embedding input.

RL Ablations: Sampling, Search Space, Strategy, and Reference Model

RL increases sampling efficiency, especially at low Pass@K, and expanding the search space (group size) improves performance up to a point. Beam search outperforms top- $k$ /top- $p$ sampling due to the prefix tree structure of semantic IDs. On-policy reference models yield better offline reward evaluation, but online improvements are limited by reward definition.

Format reward integration restores legality rates to >95% and improves online metrics (+0.13% App Stay Time, +0.30% Watch Time). Industrial Reward (SIR) reduces viral content exposure by 9.59% without degrading core metrics.

Figure 11: Format reward impact: sampling strategy and legality rates.

Tokenization and Representation Analysis

RQ-Kmeans outperforms RQ-VAE in reconstruction loss, codebook utilization, and entropy, supporting stable and generalizable tokenization. Qualitative analyses show that collaborative-aware multimodal representations retrieve videos with both semantic and behavioral relevance, overcoming limitations of unimodal approaches.

Figure 12: Top-ranked video retrieval using different representation types.

Figure 13: Coarse-to-fine semantic identifiers generated by RQ-Kmeans ( $L_t=5$ ).

Online A/B Testing and Production Deployment

OneRec was deployed in Kuaishou and Kuaishou Lite, serving 25% of total QPS. In 5% traffic experimental groups, OneRec with reward model selection improved App Stay Time by +0.54% and +1.24%, and LT7 by +0.05% and +0.08%, respectively. These gains are statistically significant at scale. In Local Life Service, OneRec achieved 21.01% GMV growth and >17% increases in order volume, buyer numbers, and new buyer acquisition, taking over 100% of QPS in that scenario.

Inference is performed on NVIDIA L20 GPUs with TensorRT optimization, custom plugins, batching, and MPS, achieving a 5 $\times$ throughput improvement and 28.8% MFU.

Implications, Limitations, and Future Directions

OneRec demonstrates that end-to-end generative architectures can surpass traditional multi-stage recommender systems in both effectiveness and efficiency, with strong scaling laws, RL integration, and robust tokenization. The system achieves high MFU and low OPEX, supporting large-scale production deployment.

However, inference stage scaling is not yet fully realized, and multimodal integration with LLMs/VLMs remains an open direction. The reward system design is still rudimentary, and further advances in reward modeling are expected to drive future improvements in recommendation quality and system consistency.

Conclusion

OneRec establishes a new paradigm for recommender systems, integrating retrieval and ranking in a unified generative framework with collaborative-aware multimodal tokenization, scalable encoder-decoder architecture, and RL-based reward alignment. The system achieves strong empirical results in both offline and online metrics, with significant improvements in computational efficiency and business impact. Future work should focus on enhancing inference reasoning, multimodal integration, and reward system sophistication to further advance the state of large-scale recommendation.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (65)

First 10 authors:

Tweets

https://twitter.com/OpenBMB/status/1941076813457043517

https://twitter.com/_reachsumit/status/1934856719785766986