Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

OneRec Architecture: Unified Generative Recommenders

Updated 4 September 2025
  • OneRec Architecture is a unified generative framework that merges retrieval, pre-ranking, and ranking using transformer-based models and semantic tokenization.
  • It reduces computational overhead by focusing on loss-driving tokens, achieving up to 94% computation reduction and efficient scaling from 0.1B to 8B parameters.
  • Advanced reinforcement learning and preference alignment techniques enable real-world optimization, enhancing user engagement and business metrics.

OneRec Architecture is a unified, end-to-end generative framework for large-scale recommender systems that integrates retrieval and ranking, leverages transformer-based foundations, and incorporates advanced reinforcement learning techniques. Its architectural design and deployment strategy mark a departure from traditional multi-stage cascaded recommendation pipelines, enabling significant computational efficiency gains, scalable modeling, and direct optimization for user preference alignment in production environments.

1. Unified End-to-End Generative Design

OneRec is constructed on an end-to-end architecture that subsumes retrieval, pre-ranking, and ranking into a single generative process. The core modeling principle is to represent user behavior and item identity hierarchically using “semantic IDs,” which are tokenized via coarse-to-fine quantization (e.g., RQ-Kmeans). Each video or item is mapped to a short sequence of tokens (e.g., {sm1,sm2,...,smLt}\{s_m^1, s_m^2, ..., s_m^{L_t}\}), while a user’s historical behavior is represented as a chronologically ordered token sequence. The architecture employs a (formerly) encoder–decoder transformer: the encoder processes multi-scale user context, and the decoder autoregressively generates the next slate of recommendations via next-token prediction. This joint modeling directly couples user and target item in the generative process, as formalized by the loss

LNTP=j=0Lt1logP(smj+1[s[BOS],sm1,...,smj]).\mathcal{L}_{\mathrm{NTP}} = -\sum_{j=0}^{L_t-1} \log P(s_m^{j+1} \mid [s_{\mathrm{[BOS]}}, s_m^1, ..., s_m^j]).

With the emergence of OneRec-V2, the architecture transitions to a “lazy decoder-only” model, eliminating the encoder and concentrating all computation on target generation. User context is processed by a Context Processor to produce static key–value pairs, which the decoder accesses via cross-attention. Each lazy decoder block consists of cross-attention (leveraging precomputed, static keys/values), causal self-attention, and a feedforward sub-module (optionally with Mixture-of-Experts to improve capacity without proportional FLOP increase). This design enables parameter scaling to 8B while maintaining efficient resource use and convergence properties.

2. Computational Efficiency and Model Scaling

Prior to OneRec, standard recommender architectures suffered from fragmented compute utilization: in encoder–decoder models, up to 97.66% of the FLOPs were spent on encoding the user context, with less than 3% contributing to the loss-driving generation phase. The adoption of the lazy decoder-only principle in OneRec-V2 addresses this bottleneck by:

  • Focusing nearly 100% of computation on loss-contributing tokens, reducing overall computation by up to 94% and training resource requirements by 90%.
  • Introducing nearly static context representations, eliminating repeated key–value projections and aligning with Grouped Query Attention to further decrease memory overhead in cross-attention.
  • Empirically scaling from 0.1B to 8B parameters, with observed convergence loss decreasing consistently (e.g., from 3.57 at 0.1B to 3.19 at 8B).
  • Achieving Model FLOPs Utilization (MFU) of up to 62% in online settings, with production training MFU reported at 23.7% and inference MFU at 28.8% on flagship GPUs.

Scaling experiments reveal patterns reminiscent of LLM scaling laws: larger models with more tokens yield lower losses, although deviations from classic scaling behavior remain and are under further paper.

3. Advanced Reinforcement Learning and Preference Optimization

Integrating reinforcement learning (RL) into the generation workflow allows OneRec to optimize directly for user engagement and business objectives. The RL strategy in OneRec encompasses the following innovations:

  • Post-supervised pretraining, OneRec applies on-policy RL using a reward system shaped from business targets (e.g., watch time, stay time, engagement probabilities).
  • Modified Group Policy Relative Optimization (GRPO) with early clipping (ECPO) constrains policy updates, stabilizing training and avoiding issues such as gradient explosion:

JECPO(θ)=Eu,{oi}i=1G[1Gi=1Gmin{πθ(oiu)πθold(oiu)Ai,clip(...)Ai}],\mathcal{J}_{\mathrm{ECPO}}(\theta) = \mathbb{E}_{u, \{o_i\}_{i=1}^{G}} \left[ \frac{1}{G} \sum_{i=1}^{G} \min \left\{ \frac{\pi_\theta(o_i \mid u)}{\pi_{\theta_{\text{old}}'}(o_i \mid u)} \cdot A_i, \mathrm{clip}(...)\cdot A_i \right\} \right],

where AiA_i is the normalized advantage (driven by a learned “P-score” preference alignment reward) and clip()\mathrm{clip}(\cdot) enforces trust region constraints.

  • In OneRec-V2, the RL module leverages real-world user feedback by integrating Duration-Aware Reward Shaping (normalizing video interactions according to duration-bucketed empirical ranks) and Adaptive Ratio Clipping via a new algorithm, Gradient-Bounded Policy Optimization (GBPO). GBPO bounds gradient magnitudes using reference gradients from binary cross-entropy loss, preventing the discarding of negative-sample gradients while stabilizing updates.

4. Preference Alignment and Reward Modeling

To robustly align generated recommendations with actual user preferences, OneRec employs multi-stage preference alignment:

  • Iterative Preference Alignment (IPA): After standard next-token pretraining, the model generates multiple candidate sessions for a user history via beam search, scores them with a reward model trained on multi-objective business outcomes, and forms preference pairs (winner vs. loser). Only a limited sample (e.g., 1%) of sessions are used per epoch, reducing computation overhead.
  • Direct Preference Optimization (DPO): The DPO loss takes the log-ratio of likelihoods for winner and loser sessions under the updated and seed models and applies a scaled sigmoid, ensuring that positive/negative preferences are learned robustly even in the single-shot display setting characteristic of recommender systems.
  • The reward model itself combines user and item embeddings (e.g., via elementwise product vuv \odot u), processes them through self-attention fusion, and outputs multiple engagement indices (watch, like, follow) via multi-head towers. Training employs binary cross-entropy loss, facilitating sensitivity to operationally relevant signals.

5. Real-World Deployment and System Infrastructure

OneRec is deployed in high-throughput environments such as the Kuaishou and Kuaishou Lite apps, where it routinely handles 25% of total query traffic and serves hundreds of millions of users. System-level engineering optimizations are critical:

  • Training and inference infrastructure utilizes 90 servers each with 8 state-of-the-art GPUs, interconnected by 400Gbps NVLink for intra-node and 400Gbps RDMA for inter-node traffic.
  • NVMe SSDs on each server accelerate checkpoints and embedding storage.
  • The SKAI framework provides cross-GPU unified embedding tables with GPU-local caching, addressing embedding lookup bottlenecks typical in large recommenders.
  • Data parallelism, gradient accumulation, ZERO1 memory optimizations for dense parameters, and mixed-precision (BFloat16) training further enhance MFU, while custom plugins for TensorRT and low-level operations (including Mixture-of-Experts, cross-attention) increase throughput.
  • Operating expense (OPEX) for OneRec is quantified at 10.6% relative to legacy cascaded systems due to reduced serialization/deserialization, lower communication cost, and improved compute utilization.

6. Empirical Impact and Performance Evaluation

OneRec and its V2 iteration are evaluated both offline and in large-scale online A/B deployments using comprehensive business and engagement metrics.

  • In controlled A/B experiments, OneRec achieves improvements such as +1.6% in total watch time (V1), with further increases in app stay time (+0.54% and +1.24%) and 7-day Lifetime (LT7), indicating tangible enhancement of the user retention experience (Zhou et al., 16 Jun 2025).
  • The session-wise generation strategy alone yields up to 1.78% gain in session watch time over pointwise baselines; iterative DPO further boosts this to 4–5% above the base generative model (Deng et al., 26 Feb 2025).
  • OneRec-V2, with the lazy decoder-only architecture and real-world RL techniques, reports additional improvements of +0.467% and +0.741% in App Stay Time while maintaining balance across multiple business and operational objectives (Zhou et al., 28 Aug 2025).
  • Model scalability is empirically validated: increasing parameter counts lead to monotonically improving convergence loss, with 1B–8B parameter models running with context lengths up to 3,000 tokens and low inference latency (36ms on L20 GPUs).

7. Practical Insights and Future Directions

The unified, generative approach brought by OneRec yields several practical lessons for recommender system design and operation:

  • Transitioning from cascaded to unified modeling eliminates “objective collision” between stages, enabling direct optimization for business metrics.
  • High compute utilization is achieved by focusing all resources on tokens driving loss during generation, transforming a fragmented resource allocation paradigm into a compute-intensive, LLM-aligned regimen.
  • RL integration with reward shaping, adaptive clipping, and real user feedback closes the historical gap between offline and online optimization.
  • Updatable item embedding tables facilitate hot-starting and catalog expansion in real-world settings.
  • Future research aims to elucidate precise scaling laws in the recommendation context (as data quantity, item diversity, and sequence length interact non-linearly with loss), to further improve reward modeling for long-term value, and to generalize stabilization strategies in reinforcement learning for diverse user and content distributions.
  • Infrastructure investment—encompassing high-throughput interconnects, low-overhead storage, and memory-efficient training—remains essential for productionizing large, generative recommenders at industry scale.

In summary, OneRec Architecture represents a paradigm shift toward end-to-end, generative, and self-optimizing recommendation models, characterized by high computational efficiency, scalable and unified modeling, reinforcement learning–based preference alignment, and validated real-world gains across core user and business metrics (Deng et al., 26 Feb 2025, Zhou et al., 16 Jun 2025, Zhou et al., 28 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)