ReRe: Reinforced Preference Optimization for Recommendation
- The paper introduces a novel framework that recasts recommendation as a sequential decision-making problem by integrating reinforcement learning with explicit reward signals and preference modeling.
- It employs constrained beam search to generate valid and diverse candidate outputs, effectively implementing on-policy hard negative sampling for robust ranking supervision.
- Experimental results show that ReRe outperforms traditional and contemporary LLM-based recommender methods on metrics like HR@K and NDCG@K across multiple real-world datasets.
Reinforced Preference Optimization for Recommendation (ReRe) refers to a new paradigm for designing recommender systems—especially those leveraging LLMs—that explicitly integrates reinforcement learning (RL) techniques, verifiable reward signals, and preference modeling. In contrast to traditional discriminative recommenders, which generally optimize for likelihood or cross-entropy based on implicit user feedback, ReRe approaches recast recommendation as a generative, sequential decision-making problem, and seek to address fundamental challenges in negative sampling and reward alignment by uniting RL, preference optimization, and advanced candidate generation methods.
1. Foundations and Motivations
Recent advances in LLMs have catalyzed the shift of recommender systems from discriminative to generative models, enabling personalized item generation and comprehensive user behavior modeling (Tan et al., 14 Oct 2025). However, standard LLM-based recommenders encounter two major limitations:
- Negative modeling deficiency: Existing approaches either ignore explicit negatives or rely on random/off-policy negatives, resulting in “easy negatives” that fail to yield strong discriminative training signals. This often leads to repetitive or invalid generated items due to the constrained output space; the lack of “hard” negative supervision undermines ranking effectiveness.
- Reliance on implicit rewards: Classical methods typically optimize likelihood margins or surrogate objectives rather than optimizing with explicit, verifiable reward feedback. As a result, improved likelihood separation may not translate to actual ranking gains—an effect sometimes called "reward hacking."
ReRe aims to address these issues by harnessing RL with verifiable (explicit) rewards, on-policy hard negative sampling, and robust candidate generation methods (Tan et al., 14 Oct 2025).
2. Core Methodologies of ReRe
2.1 Generation Constraints and Candidate Sampling
ReRe employs constrained beam search as its fundamental sampling and generation strategy. Unlike naïve sampling, which can yield invalid or duplicate item titles, constrained beam search restricts the candidate output space to valid item titles by dynamically pruning the token set at each decoding step. This is implemented by maintaining a pre-computed hash-map of valid item title prefixes and applying a hard mask to logit values for tokens that do not correspond to any legal continuation. In effect, only valid recommendations can be produced.
Diversity among sampled candidates is inherently promoted by beam search, ensuring non-overlapping (hard) negatives for every training iteration. This directly exposes the model to high-quality, diverse negative examples—a crucial prerequisite for effective ranking supervision, particularly in generative models with a narrow vocabulary (Tan et al., 14 Oct 2025).
2.2 Reward Design and Preference Signals
ReRe adopts an explicit, verifiable reward structure. The primary rule-based reward assigns one to exact matches (generated item identical to ground-truth target) and zero to all others. However, such a sparse binary signal results in extremely coarse supervision for negatives. To ameliorate this, ReRe augments the reward with an auxiliary ranking reward: each negative candidate (non-target) is ranked according to its likelihood; a finer-grained reward proportional to its rank (e.g., , where %%%%1%%%% is the candidate's rank) is computed and added to the rule-based reward after group normalization. Thus, hard negatives (those ranked closer to the ground truth) are penalized more, providing token-level gradients for discriminative learning.
This dual reward structure ensures that recommendation models are trained with both verifiable ground-truth alignment and sensitivity to fine-grained differences among difficult negatives (Tan et al., 14 Oct 2025).
2.3 Reinforcement Learning with Verifiable Rewards (RLVR)
ReRe implements a reinforcement learning framework anchored in on-policy negative sampling and explicit feedback. Distinct from off-policy or fixed-reference negative selection, negatives are sampled on-policy (from the current model); as the agent evolves, so does the difficulty of the negative set. This is critical for aligning gradient signals with the actual recommendation challenge.
ReRe's optimization is based on Group Relative Preference Optimization (GRPO). Here, a batch of (beam search–generated) candidate responses is produced, rewards are assigned (via rule-based and ranking components), and token-level advantages are calculated:
where is the normalized advantage for token in candidate , is the KL-penalty weight, and is a reference (often SFT-initialized) policy. Thus, each token's contribution to the policy gradient is scaled by both its reward and its relative hardness in the candidate set.
3. Experimental Evidence and Benchmarking
ReRe was validated on three real-world recommendation datasets: Amazon Toys, Amazon Industrial, and Yelp (Tan et al., 14 Oct 2025). Benchmarks included Hit Ratio (HR@K) and Normalized Discounted Cumulative Gain (NDCG@K) for various values of . Empirical findings demonstrated that:
- ReRe outperforms both classical sequence recommenders (GRU4Rec, Caser, SASRec) and contemporary LLM-based preference optimization baselines (TIGER, BigRec, D³, S-DPO, SPRec), achieving higher HR and NDCG across all datasets and hyperparameter regimes.
- Performance gains were robust to backbone LLM selection, model scaling, and initialization strategy (whether base or SFT-initialized), underscoring the generality of the approach.
- Supplementing rule-based rewards with auxiliary ranking rewards was essential; dense semantic or collaborative rewards were found to be less aligned with actual ranking objectives due to reward hacking tendencies.
Beam search sampling, together with dynamic group-based constraints, preserved hard negative diversity throughout training—avoiding negative collapse and enabling continued policy improvement (Tan et al., 14 Oct 2025).
4. Design Space Analysis and Implementation Guidelines
The authors systematically investigated the design space for RLVR in ReRe, yielding several practical findings (Tan et al., 14 Oct 2025):
Component | Key Choices and Outcomes | Impact on Performance |
---|---|---|
Candidate Generation | Constrained token-by-token beam search with dynamic masking | Maintains only valid, unique outputs |
Negative Sampling | On-policy, beam search–generated with explicit constraints | Provides challenging, up-to-date negatives |
Reward Modeling | Binary (exact match) + ranking reward (log-rank penalty) | Avoids reward hacking, enhances alignment |
Optimization Algorithm | Group/token–level policy gradient (GRPO), supports DAPO/GSPO variants | Flexible; performance comparable across GRPO/DAPO/GSPO |
The use of on-policy, constrained beam search is particularly critical for practical deployment in generative recommenders due to the narrow valid output space.
5. Extensions, Generalization, and Open Directions
ReRe offers a framework and empirical blueprint for uniting RL, explicit reward modeling, and advanced negative sampling in generative recommendation. Suggested avenues for further research include:
- Increasing negative candidate diversity: Further scaling the number of candidate generations per training step to examine the upper limits of ranking signal enhancement from harder negatives (Tan et al., 14 Oct 2025).
- Domain adaptation: Investigating ReRe’s application in out-of-domain and low-resource settings (e.g., Yelp), where in-domain adaptation and efficient transfer from pretrained LLMs is a primary concern.
- Reward signal design: Exploring richer, dense or contextualized rewards (semantic/collaborative), while critically assessing risk of reward hacking and ensuring reward alignment with the principal objective.
- Broader applicability: Extension to cross-domain and cold-start recommendation scenarios, leveraging ReRe’s robust on-policy preference alignment and constrained token generation.
6. Broader Impact and Significance
ReRe marks a shift for LLM-based recommenders from likelihood-centric or supervised paradigms to preference-centric, reinforcement-based optimization. By tightly coupling candidate generation, reward feedback, and hard negative learning, ReRe establishes a scalable standard for future generative recommendation systems that require both high ranking accuracy and interpretable, verifiable feedback channels (Tan et al., 14 Oct 2025).
The overall approach is agnostic to underlying backbone LLMs, compatible with both pretrained and SFT-initialized models, and is generalizable to various RL-inspired policy optimization algorithms. The emphasis on explicit, interpretable reward models and on-policy negative learning offers a structural remedy to the key deficiencies identified in contemporary LLM-enhanced recommendation architectures.