Generative Recommenders: A New Paradigm
- Generative Recommenders are models that frame recommendation as a stochastic generative process using autoregressive and latent variable methods.
- They integrate sequential modeling, context-awareness, and transformer architectures to jointly capture user history and item interactions for personalization.
- Recent studies show GRs outperform conventional methods in scalability and efficiency, achieving notable gains in metrics like NDCG and CTR.
Generative recommenders (GRs) are a family of recommender system models that approach the recommendation problem as a stochastic generative process, typically leveraging autoregressive sequence modeling, latent variable methods, or LLMs to synthesize, rerank, or directly generate personalized item lists or even new items. This paradigm departs from traditional discriminative, retrieval-based methods by modeling the joint or conditional probability of items, contexts, and user actions, thus offering new capabilities in context-awareness, controllability, and integrated content creation. Recent research demonstrates that GRs can outperform point-wise and list-wise reranking models, enable controllable and inductive recommendations, close the performance gap with strong ID-based baselines, and scale to trillion-parameter models with favorable efficiency and scaling properties (Feng et al., 2021, Wang et al., 2023, Guo et al., 2023, Zhai et al., 27 Feb 2024, Senel et al., 7 Jun 2024, Ding et al., 3 Oct 2024, Wang et al., 10 Feb 2025, Xiao et al., 10 Feb 2025, Zhu et al., 30 Mar 2025, Zhang et al., 10 Apr 2025, Huang et al., 7 May 2025, Lee et al., 2 Jun 2025, Wang et al., 19 Jun 2025, Wang et al., 30 Jun 2025, Yang et al., 9 Jul 2025, Ma et al., 19 Jul 2025, Lepage et al., 12 Aug 2025).
1. Foundations and Paradigmatic Shifts
At the core of generative recommenders is the reformulation of recommendation as a generative modeling problem, where item sequences, complete item lists, or even new item content are stochastically produced rather than selected from a fixed corpus. Architectures such as sequential transducers and generative rerankers model user interaction histories and candidate items as sequences, using autoregressive objective functions like over tokenized representations (Zhai et al., 27 Feb 2024, Yang et al., 9 Jul 2025). This approach allows explicit modeling of context and mutual item influence, distinguishing GRs from greedy reranking or embedding-matching baselines (Feng et al., 2021, Wang et al., 10 Feb 2025, Huang et al., 7 May 2025, Lepage et al., 12 Aug 2025).
A further shift involves the movement from discriminative to generative architectures: whereas conventional deep learning recommenders focus on extracting and combining engineered features, GRs unify representation, temporal context, and content generation into a single modeling pipeline, often using LLMs or transformer-derived architectures as backbone (Wang et al., 2023, Zhai et al., 27 Feb 2024, Yang et al., 9 Jul 2025).
2. Core Methodologies and Model Architectures
2.1 Sequence-based Generation and Context-wise Reranking
Many state-of-the-art GRs employ autoregressive sequence modeling to capture both position-wise and list-wise dependencies. For example, the GRN model (Feng et al., 2021) pairs a Bi-LSTM/self-attention evaluator (for context-aware interaction probability estimation) with a GRU/pointer-network generator, which sequentially reranks items by considering evolving user intent and mutual item effects in the final ranking list. The generator is optimized with policy gradient methods under evaluator “advantage rewards,” while the evaluator is trained via cross-entropy loss.
The NLGR framework (Wang et al., 10 Feb 2025) further improves this by introducing non-autoregressive sampling across neighbor lists in the combinatorial permutation space of candidate lists. This approach allows the generator to directly “jump” to neighbor lists for potential utility gains, overcoming limitations of sequential (autoregressive) or greedy reranking.
2.2 Generative Foundation Models and Scaling
Hierarchical Sequential Transduction Unit (HSTU) architectures (Zhai et al., 27 Feb 2024) sequentialize heterogeneous user-item features and actions, replacing feature extractors and specialized interaction networks with unified pointwise-gated attention blocks. These models scale up to 1.5 trillion parameters and exhibit power-law scaling with respect to compute, showing quality improvements analogous to those in GPT-3 and LLaMA-2 for LLMing. Efficient attention (e.g., relative bias, Stochastic Length, M-FALCON) allows deployment on billion-user platforms with up to 15× faster inference than flash attention transformers.
RankGPT (GenRank) (Huang et al., 7 May 2025) introduces action-oriented sequence organization (decoding user actions conditioned on item context, with position-time biases and efficient embedding), achieving 95% training speed-ups and measurable A/B test improvements in industrial settings.
2.3 Tokenization and Cross-modality Fusion
To mitigate the performance gap with discriminative models, new tokenization schemes such as COSETTE (Lepage et al., 12 Aug 2025) incorporate collaborative signals via contrastive learning into discrete item identifiers, optimized for both content reconstruction and co-occurrence relevance. Multi-granular and multimodal fusion approaches (e.g., GRAM (Lee et al., 2 Jun 2025), PRORec (Xiao et al., 10 Feb 2025), MGR-LF++ (Zhu et al., 30 Mar 2025)) align and fuse semantic, collaborative, and multi-modal (text/image) representations, mitigating the semantic domination and modality correspondence problems identified in naive fusion strategies.
3. Capabilities: Controllability, Inductive and Generative Abilities
GRs can enable fine-grained control and personalization through disentangled latent spaces, as shown in (Bhargav et al., 2021), where user feedback is mapped to specific dimensions (“knobs”) of a supervised -VAE latent space, allowing dynamic preference control and predictable updates to generated item lists.
Recent advances further address the induction problem: classical autoregressive GRs are transductive and cannot generate unseen items. SpecGR (Ding et al., 3 Oct 2024) overcomes this by inserting an inductive “drafter” (using representation-based retrieval) that proposes both seen and new items, and a generative verifier that scores the candidate likelihoods. Guided beam prefix strategies align the drafter and verifier, supporting both novel and in-sample recommendations with reduced computational cost.
GeneRec (Wang et al., 2023) and GEMRec (Guo et al., 2023) extend GRs beyond next-item prediction to retrieval, repurposing, and even content creation using LLM or diffusion-based AI editors and creators, guided by natural language instructions and fidelity checks (bias, safety, authentication).
4. Fine-tuning, Exposure Bias, and Model Optimization
Fine-tuning techniques for GRs must address exposure bias—the tendency of supervised objectives to train only on observed data, neglecting unobserved but potentially relevant trajectories. GFlowGR (Wang et al., 19 Jun 2025) applies Generative Flow Networks (GFlowNets) for multi-step trajectory exploration, fine-tuning LLM-based recommenders via adaptive trajectory sampling using collaborative signals and reward models that integrate collaborative scores, token similarity, and augmentation. This mitigates exposure bias and improves hit rate and NDCG relative to SFT or reinforcement learning-based alternatives.
Policy-gradient reinforcement learning (Feng et al., 2021) and curriculum-based training (Wang et al., 19 Jun 2025) have also been found crucial for optimizing contextually aware generators under evaluator guidance.
5. Efficiency, Scalability, and Industrial Deployment
Progress in modeling efficiency is evidenced by decoupled architectures (e.g., MARIUS (Lepage et al., 12 Aug 2025): sequence-level temporal transformer plus short-depth item decoder), lightweight sparse attention (GRACE (Ma et al., 19 Jul 2025)), and fast latent-index modeling.
Large-scale deployments (HSTU-based GRs (Zhai et al., 27 Feb 2024), RankGPT (Huang et al., 7 May 2025), NLGR (Wang et al., 10 Feb 2025)) demonstrate that generative architectures can achieve significant online A/B metric lifts (e.g., +12.4% NDCG, +3% CTR, +6% PV/IPV), with tractable latency (1.6 ms online cost, 0% timeout rates), and serve hundreds of millions to billions of users. These findings underscore the operational viability of the generative paradigm in high-throughput, production settings.
6. Evaluation, Applications, and Future Directions
Empirical evaluation consistently reports improvements in listwise NDCG, Recall@K, and click/conversion metrics, especially in scenarios where candidate lists evolve rapidly, mutual influences are significant, or content generation is required (Feng et al., 2021, Huang et al., 7 May 2025, Wang et al., 2023). Synthetic data generation frameworks (HYDRA (Mungari et al., 23 Jul 2024)) facilitate testing GRs under controllable long-tail and community structures.
Future research directions, as delineated in multiple surveys and proposal papers (Yang et al., 9 Jul 2025, Wang et al., 2023), encompass: scaling model and sequence length further; developing unified “one model for all” architectures that integrate recall, ranking, search, and multi-scenario personalization; advanced reinforcement learning for online preference alignment; and improved fidelity and explainability through explicit token reasoning, multimodal alignment, and interpretable tokenization (e.g., chain-of-thought, chunk-level, progressive context fusion).
A plausible implication is that as LLM-based GRs become foundation models, the recommendation task will continue to converge with natural language generation and data-to-text modeling, supporting not only retrieval and ranking but also creative, adaptive content production in an end-to-end system.