OneRec: Unified Generative Recommender
- OneRec is a family of generative recommender systems that integrates item retrieval, ranking, and optimization into a single autoregressive Transformer model.
- It employs advanced tokenization, supervised and on-policy reinforcement learning, and quantization techniques to achieve high computational and operational efficiency.
- Deployed on digital content platforms, OneRec significantly improves user engagement and reduces operational costs compared to legacy multi-stage pipelines.
OneRec refers to a family of industrial-scale, end-to-end generative recommender systems that unify item retrieval, ranking, and optimization within a single large autoregressive Transformer architecture. Originally developed for high-frequency digital content platforms, OneRec and its successors fundamentally restructure recommendation workflows by aligning design, optimization, and deployment paradigms with contemporary LLMs. This approach displaces the legacy multi-stage “recall–pre-rank–rank” pipeline in favor of a monolithic, sequence-generation model trained with both supervised and on-policy reinforcement learning, and tightly co-optimized for modern hardware stacks (Zhou et al., 16 Jun 2025, Deng et al., 26 Feb 2025, Zhou et al., 28 Aug 2025).
1. Architectural Paradigm: End-to-End Generative Recommendation
OneRec models recast the recommendation problem as an autoregressive conditional generation task, mapping a user’s historical interaction sequence (augmented by multimodal, hierarchical user features) directly to a slate of recommended item “semantic IDs.” The core stack consists of:
- Tokenizer: Each item is represented as a compact sequence of quantized “itemic” or “semantic ID” tokens, generated using residual quantization (RQ-Kmeans) over multimodal item embeddings. This construction enables efficient mapping between real-world items and the model’s input/output streams (Zhou et al., 16 Jun 2025, Zhou et al., 31 Dec 2025).
- Transformer Backbone: Early versions employ a deep encoder–decoder Transformer with optional sparse Mixture-of-Experts (MoE) layers for parameter-efficient scaling. Later versions (OneRec-V2) adopt a Lazy Decoder-Only architecture, using a lightweight context processor and dispensing with full sequence encoding, to maximize computational efficiency at training and inference (Zhou et al., 28 Aug 2025).
- Reward Modeling and RL: The system is trained with a hybrid of supervised cross-entropy (next-token prediction) and on-policy RL (e.g., Early-Clipped GRPO, Gradient-Bounded Policy Optimization), aligning generation with industrial user feedback targets such as watch-time, app stay time, or multi-objective preferences (Deng et al., 26 Feb 2025, Zhou et al., 28 Aug 2025).
A high-level sketch:
1 2 3 4 5 |
user history + profile → context processor → decoder-only Transformer
↓
multi-stage RQ-Kmeans tokenizer ↔ {itemic token sequence}
↓
next-item(s) prediction |
2. Model Scaling Laws and Infrastructure Efficiency
OneRec leverages scaling laws analogous to LLMs: empirical loss follows power-law trends in both model size and data volume within regime boundaries. As model and data scales increase (0.015B to 8B+ parameters, hundreds of billions of tokens), performance continues to improve, albeit with diminishing returns (Zhou et al., 16 Jun 2025, Zhou et al., 28 Aug 2025, Zhou et al., 31 Dec 2025).
- Compute Allocation: By design, OneRec is compute-intensive, exhibiting 10× more FLOPs per query than classical ranking models and yielding Model FLOPs Utilization (MFU) of 23.7% (training) and 28.8% (inference) on flagship GPUs—approaching LLM-class MFU. The Lazy Decoder-Only design in OneRec-V2 allows 94% compute reduction compared to encoder–decoder for identical task loss (Zhou et al., 28 Aug 2025).
- Operational Efficiency: Unified architecture and tightly integrated inference routines slash communication, storage, and network I/O costs. Operating expenditure (OPEX) per QPS is reduced to ∼10.6% compared to legacy cascaded recommenders (Zhou et al., 16 Jun 2025).
3. Policy Optimization and Real-World Preference Alignment
OneRec advances reinforcement learning in recommendation by introducing mechanisms to align model outputs directly with authentic user feedback signals:
- Reward Model (OneRec V1): A learned function predicts composite industrial engagement objectives (watch-time, likes, follows) for generated sessions; this guides Direct Preference Optimization (DPO) in the Iterative Preference Alignment loop (Deng et al., 26 Feb 2025).
- Real-World Feedback Alignment (OneRec-V2): Incorporates Duration-Aware Reward Shaping (bucket-based quantile normalization of playtime, emphasizing relative rather than absolute engagement) and Gradient-Bounded Policy Optimization (GBPO). GBPO adaptively bounds importance ratios during policy gradient updates, preventing instability from negative feedback dominance and yielding stable preference optimization directly from live user data (Zhou et al., 28 Aug 2025).
- Multi-Objective Trade-offs: By integrating preference shaping and adaptive ratio clipping, OneRec achieves stable and efficient learning of diverse objectives such as user retention, engagement, and content diversity, as shown in multi-metric A/B testing at scale.
4. Extensions: Reasoning, Foundation Models, and Optimization Innovations
The OneRec ecosystem has evolved to bridge gaps in interpretability, generalization, and optimization:
- OneRec-Think enhances the standard framework with explicit chain-of-thought (CoT) reasoning. This extension structures generation into rationales (natural-language explanations) and itemic tokens, promoting transparency, controllability, and robustness. Reasoning modules include Itemic Alignment (cross-modal alignment of item and text spaces), Reasoning Activation (CoT rationale distillation), and Reasoning Enhancement (Rollout-Beam reward design for multi-validity user preferences). A “Think-Ahead” architecture decouples heavy reasoning computation from low-latency serving (Liu et al., 13 Oct 2025).
- OpenOneRec and OneRec-Foundation introduce foundation-model scale co-pretraining (1.7B–8B) with expanded context length (up to 32K tokens), extensive cross-domain and instructional-task coverage, and joint optimization for both general text understanding and recommendation. OpenOneRec provides RecIF-Bench, a holistic multi-task evaluation suite, and demonstrates SOTA results across eight granular capabilities, including instruction following and explanation (Zhou et al., 31 Dec 2025).
- Policy Optimization Evolution (SAGE): SAGE replaces GBPO's static, symmetric update rules with a sequence-level, dynamically adaptive gradient manifold. Key innovations include geometric mean importance ratios, decoupled multi-objective advantage normalization, a Boost Factor for cold-start item updates, and an Entropy-Aware Penalty to sustain diversity and suppress "information cocoons," enhancing both stability and coverage in real recommendation environments (Xie et al., 29 Jan 2026).
5. Quantization and System-Level Inference Optimization
OneRec-V2 exhibits weight and activation distributions closely matching LLMs, with tightly regulated value ranges (mean variance <0.1 for weights, mean absmax ≈2.0), making it amenable to aggressive FP8 post-training quantization (Su et al., 12 Mar 2026). The quantized inference pipeline includes:
- Per-channel offline quantization for weights and per-token dynamic scaling for activations.
- Fused quantization-aware kernels, custom TopK (radix-based), pipelined attention, and MoE execution exploiting hardware primitives.
- Production deployment achieves 49% latency reduction and 92% throughput increase without metric degradation, validated by extensive online A/B tests.
6. Empirical Results and Industrial Deployment
OneRec and its descendants have been deployed at full production scale on Kuaishou and Kuaishou Lite, serving over 400 million daily active users and handling a major fraction of total query volume (Zhou et al., 16 Jun 2025, Deng et al., 26 Feb 2025, Zhou et al., 28 Aug 2025). Quantitative outcomes include:
- App Stay Time improvement: +0.54% to +1.24% (V1), +0.467% to +0.741% (V2), with consistent lifts in watch time, retention (LT7), and all downstream user engagement metrics.
- Cost efficiency: OPEX reduced to 10.6% of previous pipelines.
- SOTA accuracy and diversity: RecLLM (SAGE-based) achieves significant gains in Recall@10, NDCG, entropy/diversity, and cold-start recommendation compared to prior baselines (Xie et al., 29 Jan 2026).
- OpenOneRec models outperform leading retrieval and generative baselines by an average 26.8% Recall@10 across 10 Amazon domains (Zhou et al., 31 Dec 2025).
| Generation | Architecture | Preference Alignment | Improvement (App Stay) | SOTA Rec. Accuracy |
|---|---|---|---|---|
| OneRec | Encoder–Decoder | DPO + Reward Model | +0.54%/+1.24% | Yes |
| OneRec-V2 | Decoder-Only | Real Feedback + GBPO | +0.467%/+0.741% | Yes |
| OneRec-Think | Reasoning-augmented | Chain-of-Thought + GRPO | +0.159% | Yes |
| OneRec-Foundation | Foundation Model | GRPO, cross-domain SFT | +26.8% Recall@10 avg. | Yes |
7. Challenges and Future Research
Despite demonstrable advances, several technical obstacles remain:
- Tokenizer Transferability: Semantic ID collisions (>30%) and maintenance overhead in cross-domain transfer motivate exploration of text-only or hybrid representations (Zhou et al., 31 Dec 2025, Xie et al., 29 Jan 2026).
- Generalization and Forgetting: Achieving simultaneous mastery of recommendation and general reasoning, while preventing catastrophic forgetting, is an open problem. Current data-mix ratios and retention strategies are heuristic, suggesting the need for formal optimization methods.
- Reward and RL Design: Sparse rewards (hit/miss) limit alignment; richer, hierarchically structured objectives (e.g., NDCG, long-horizon satisfaction) and systematic chain-of-thought reasoning mechanisms remain active areas for study (Liu et al., 13 Oct 2025, Zhou et al., 31 Dec 2025).
- Theoretical Scaling Laws: Observed deviations from Chinchilla-like scaling in recommendation domains indicate gaps in understanding of optimal model and data allocation, especially for multimodal and cross-task learning.
- System Scalability: While FP8 quantization closes much of the LLM–recommender efficiency gap, further advances in quantization, kernel fusion, and memory access are required to sustain scaling at ultra-high QPS in dynamic catalog settings (Su et al., 12 Mar 2026).
References:
- (Zhou et al., 16 Jun 2025) OneRec Technical Report
- (Deng et al., 26 Feb 2025) OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment
- (Zhou et al., 28 Aug 2025) OneRec-V2 Technical Report
- (Liu et al., 13 Oct 2025) OneRec-Think: In-Text Reasoning for Generative Recommendation
- (Zhou et al., 31 Dec 2025) OpenOneRec Technical Report
- (Xie et al., 29 Jan 2026) SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation
- (Su et al., 12 Mar 2026) Quantized Inference for OneRec-V2