GPTRec: Generative Sequential Rec

Updated 25 November 2025

The paper introduces GPTRec, a generative sequential recommendation framework that leverages autoregressive Transformer decoders for token-by-token slate generation.
It details innovative tokenization and decoding strategies—including SVD tokenisation, textualized metadata, and multi-sequence aggregation—to boost long-horizon recommendation accuracy.
The work further explores reinforcement learning fine-tuning and scalability innovations to optimize beyond-accuracy metrics and enable industrial-scale deployments.

Generative Sequential Recommendation with GPTRec refers to a paradigm in which future items for a user are produced via autoregressive generation with GPT-style (Transformer decoder-only) LLMs, treating recommendation as sequence continuation. Unlike conventional top-K scoring based on independent item ranking, GPTRec generates recommendation slates token-by-token or item-by-item, leveraging the modeling power and sequential conditioning of LLMs for both short- and long-horizon user-item interaction forecasting.

1. Autoregressive Formulation and Model Architecture

GPTRec approaches sequential recommendation by modeling the probability of the next K items conditioned on a user's interaction history $H=(x_1,\ldots,x_t)$ as a joint autoregressive distribution:

$p(x_{t+1},\ldots,x_{t+K}\mid H) = \prod_{k=1}^K p\bigl(x_{t+k}\mid x_{1:t+k-1},\,H\bigr)$

The model backbone is a decoder-only Transformer trained with a next-token (next-item) cross-entropy objective. At each position, the model predicts the probability of each possible item, using causal self-attention to ensure that prediction at step $k$ only attends to previously generated items and the user history.

Two main input encoding schemes are found:

Item ID tokens: Original models used item IDs as tokens, relying on embedding tables of size $|I|$ (number of items). To address scalability, generative tokenization strategies such as splitting items into sub-tokens (via SVD Tokenisation) or generating semantically rich textual IDs (using auxiliary models) are applied (Petrov et al., 2023, Tan et al., 2024).
Textualized item metadata: Advanced frameworks flatten all item attributes (title, brand, category, description) into human-language token sequences, aligning recommendation input spaces with LLM pretraining (Ngo et al., 2024, Li et al., 2024, Tan et al., 2024).

Some GPTRec variants inject personalization directly via user-ID embeddings (Zhang et al., 2024).

The overall architecture comprises stacked masked Transformer decoders (2–32 layers for small to large-scale settings), positional encodings (learned, ALiBi, or relative as in HSTU (Zhai et al., 2024)), and a classification head over the candidate item set or token vocabulary.

2. Decoding Strategies: Greedy, Beam, Sampling, and Aggregation

Inference in GPTRec diverges distinctly from classical one-shot candidate ranking by producing recommendations via multi-step decoding. Principal strategies include:

Greedy decoding: Iteratively select the highest-probability item at each step, appending each output to the context. Well suited for mid-horizon prediction, matching or exceeding standard Top-K methods on Recall@K, NDCG@K (Volodkevich et al., 2024).
Beam search: Track $B$ top sequences, expanding each at every step, scoring by cumulative log-probability. For sequential recommendation, beam search does not always outperform greedy due to potential "beam collapse" on early choices (Volodkevich et al., 2024).
Temperature/top-k sampling: At each generation step, sample from the softmax distribution with temperature $\tau$ , optionally restricting to the $k$ most probable outputs. Broader $\tau$ promotes diversity but may reduce precision. Generally, single-sequence temperature sampling is less effective than greedy (Volodkevich et al., 2024).
Multi-sequence sampling with aggregation: Generate $S$ $S$ diverse candidate sequences (with moderately high $\tau$ $τ$ ) and aggregate their evidence.
- Reciprocal Rank Aggregation (RRA) scores items by their inverse rank across multiple sequences.
- Relevance Aggregation (RA) pools softmax probabilities across all sequences and positions, yielding higher coverage and robustness at the expense of higher inference cost.
- Both RRA and RA substantially improve long-horizon recommendation quality ( $>$ 10–30% relative gain on NDCG@K) by combating error propagation and generating more diverse, stable slates (Volodkevich et al., 2024).

The table below summarizes these strategies:

Decoding Strategy	Output Diversity	Performance on Long Horizons	Inference Cost
Greedy	Low	Good, degrades slowly	Fast (baseline)
Beam Search	Moderate	Underperforms, beams collapse	Moderate
Temperature Sampling	High (single seq)	Underperforms	Moderate
RRA / RA (Multi-Samp)	Highest	Best at K>5	1.5x (RA)

3. Tokenization and Item Representation

A primary challenge for GPTRec is scaling to large catalogues. Several tokenization approaches are implemented:

SVD Tokenisation: Each item is mapped to a short sequence of sub-tokens by quantizing item embedding coordinates derived from truncated SVD of the user-item matrix. This scheme reduces embedding table size by 40% (e.g., $t=4$ , $v=512$ : 2048 sub-tokens for 3416 items) and preserves recommendation quality (Petrov et al., 2023).
Textual ID Learning (IDGenRec): Items are mapped to concise, semantically meaningful textual IDs using a sequence-to-sequence model trained with diverse beam search and embedding interpolation. This method guarantees uniqueness, exploits the LLM's vocabulary, and supports foundation model generalization across domains (Tan et al., 2024).
Full metadata serialization: Some large-scale and text-based frameworks use entire item metadata, flattened into a sequence of natural language tokens, aligning with LLM capabilities and enabling rich cross-domain transfer (Ngo et al., 2024, Li et al., 2024).

4. Optimization and Training Paradigms

The training process centers on maximizing the likelihood of the next item conditioned on the preceding history, using conventional cross-entropy loss. Several advances are found:

Prompt-tuning and pseudo-interaction augmentation: Models such as RecGPT incorporate pre-training via auto-regression followed by fine-tuning with personalized prompts—generated "pseudo-items" serving as lightweight intermediate supervisions (Zhang et al., 2024).
Contrastive objectives: CALRec introduces a mixed loss combining language modeling with InfoNCE contrastive alignment between user, item, and user–item embeddings, leveraging two-tower architectures and multi-domain fine-tuning for robust sequence understanding (Li et al., 2024).
Teacher-student distillation and reinforcement learning: A two-stage process is beneficial when optimizing for listwise or beyond-accuracy metrics. First, GPTRec mimics a strong Top-K model as a student; second, RL fine-tuning with PPO aligns slate generation to diversity, coverage, or popularity-bias reduction objectives (Petrov et al., 2024).
Large-scale streaming: In production, training is parallelized over very long sequences (up to 8k tokens per user), with stochastic sampling and custom attention mechanisms (e.g., HSTU for faster computation, memory fusion, sparse microbatching) (Zhai et al., 2024).

Early stopping is performed on validation NDCG@K, and hyperparameters (e.g., sampling temperature $\tau$ , RRA/RA sample count $S$ ) are tuned per domain.

5. Empirical Evaluation and Performance

Evaluations consistently use long-horizon protocols: for each user, the last $N=10$ events are withheld and predicted autoregressively from the preceding history.

Metrics: Recall@K, NDCG@K, HitRate@K (with breakdown by true item position in the target block), Precision@K/MAP@K. Beyond-accuracy metrics include intra-list diversity (ILD@K), coverage, and normalized popularity count (nPCOUNT@K) (Volodkevich et al., 2024, Petrov et al., 2024).
Baselines: BPR-MF, SASRec, BERT4Rec, GPT-2+Top-K scoring, and hybrid scoring+reranking.
Findings:
- Multi-sequence autoregressive (RRA/RA) decoding provides +10–30% NDCG@10 gain on long-horizon steps compared to standard Top-K and basic autoregressive.
- Greedy generation is robust for medium-range prediction, outperforming beam search.
- RL-trained GPTRec dominates re-ranking-based approaches in Pareto tradeoffs between accuracy and diversity or bias reduction (Petrov et al., 2024).
- Scaling to foundation models with textual IDs allows competitive zero-shot transfer across domains (Tan et al., 2024, Zhai et al., 2024).
- Inference cost increases linearly with the number of sampled sequences ( $S$ ); practical deployments parallelize across users and retain real-time constraints (40ms vs. 25ms per user for $S=30$ ) (Volodkevich et al., 2024).

6. Extensions, Foundation Models, and Future Directions

Foundation models for recommendation: Pre-training GPT-style models on wide multi-domain "fusion" corpora, combined with semantically expressive tokenization (textual IDs, item metadata), yields universal recommenders, evidenced by state-of-the-art zero-shot performance on unseen datasets (Tan et al., 2024, Zhai et al., 2024).
Beyond-accuracy objectives: RL-finetuned GPTRec models can directly optimize non-accuracy metrics—especially diversity and popularity bias—by shaping generation rewards (e.g., intra-list cosine distance, calibrated slate popularity). This provides smoother Pareto frontiers versus post-hoc reranking (Petrov et al., 2024).
Interpretability and cold-start: Query-generation frameworks (GPT4Rec) allow the recommender to explain its slates in natural language and recommend cold-start items by generating relevant queries and retrieving by BM25 (Li et al., 2023).
Scalability innovations: Custom hierarchical layers (HSTU), efficient microbatching (M-FALCON), and stochastic length truncation enable trillion-parameter GPTRec deployments at web scale, optimizing both computation and model quality (NDCG@10 up to +65% over state-of-the-art) (Zhai et al., 2024).
Open challenges: Scaling sub-tokenization for million-item catalogs, integrated instruction-based control over diversity and bias trade-offs, shared user/action representations across modalities, and seamless handling of evolving item catalogs and cold-start scenarios.

7. Best Practices and Implementation Guidelines

Use a decoder-only Transformer (GPT-2 or variants) with next-item cross-entropy. Hidden size 64–256 and 2 layers suffice for mid-scale; scale appropriately for industrial settings.
For multi-sequence aggregation (RRA/RA), sample $S=20$ –50 sequences at temperature $\tau=0.5$ –2.0; $k=10$ suffices as a top-k filter in RRA.
Greedy decoding is optimal for fastest inference; RA offers the best long-horizon accuracy with a moderate compute trade-off; RRA is a cost-efficient alternative.
Textualized item representations or learned compact textual IDs should replace raw numeric IDs to leverage LLM pre-training and enable zero-shot generalization.
Tune list length $K$ , beam width ( $B$ ), temperature, and aggregation parameters per application and dataset.
For beyond-accuracy or multi-objective optimization, use teacher-student pre-training followed by RL fine-tuning with policy-gradient (PPO), combining accuracy and secondary rewards in the slate objective.

GPTRec encapsulates the shift from static candidate scoring to generative, sequence-aware recommendation, integrating innovations in tokenization, aggregation, large-scale training, and explicit beyond-accuracy optimization. Empirical evidence highlights the superiority of multi-sequence autoregressive generation with aggregation—especially for long-range and diversified slate recommendation—while remaining computationally tractable for industrial deployment (Volodkevich et al., 2024, Petrov et al., 2023, Petrov et al., 2024, Zhai et al., 2024, Tan et al., 2024).