Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

Published 7 Jun 2026 in cs.LG, cs.AI, and cs.IR | (2606.08480v1)

Abstract: Reinforcement learning (RL) presents a promising avenue for enhancing generative recommendation beyond supervised imitation, leveraging reward signals to guide policy improvement. However, its efficacy is critically contingent on the trustworthiness of the reward model for the samples it evaluates. In practice, production rankers, the widely adopted reward models, are trained on exposure-biased logs, leading to sample-dependent inaccuracies that violate this assumption. Our stratified analysis uncovers a consistent pattern: reward guidance is most beneficial when the policy exhibits uncertainty and the ranker can effectively discriminate the ground-truth item from rollout negatives. On other samples, the reward signal is either negligible or detrimental, highlighting the risk of uniform RL application. To address such an issue, we introduce AdaGRPO, a novel framework that treats reward-guided optimization as selective admission rather than uniform pressure. Training is anchored in supervised negative log-likelihood, while the GRPO objective is gated by a binary, per-sample clip determined by two rollout diagnostics: policy-side difficulty and reward discriminability. Instances failing either diagnostic default to pure supervision, ensuring stability and mitigating the amplification of noisy gradients. We validate AdaGRPO on a large-scale e-commerce dataset. At the best intermediate checkpoint, it elevates HR@10 from 11.01% to 12.18% while constraining hallucination below 0.22%, and maintains robustness at the final checkpoint (HR@10 11.63%, hallucination 0.27%), outperforming fixed NLL--GRPO mixtures across the retrieval--validity frontier. In production A/B tests, AdaGRPO achieves statistically significant gains in click-through rate and dwell time, confirming its practical utility.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents AdaGRPO which introduces a conditional RL loss gating mechanism to selectively apply reward updates based on reward model reliability.
It demonstrates that stratified reward application improves HR@10 and controls hallucination, outperforming uniform GRPO methods in e-commerce recommendations.
Empirical evaluations in both offline and online settings validate its robustness in balancing retrieval accuracy with effective reward maximization.

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

Motivation and Problem Formulation

Generative Retrieval (GR) via autoregressive LLMs has emerged as a scalable alternative to embedding-based recall in industrial recommender systems, providing enhanced coverage for semantically similar and long-tail catalog items. However, the primary challenge for GR models is fine-grained discrimination, as these models lack dedicated per-item parameters and dense cross-feature input, making them inferior to the ranking stage in precision. Reinforcement learning (RL) using reward models (RM) has thus been proposed to align GR policies with high-value targets as scored by production rankers. Yet, RMs are trained on exposure-biased logs, rendering their reliability sample-dependent, especially for rarely exposed or fresh catalog items.

The central issue addressed in this paper is the mismatch between uniform application of reward-guided RL fine-tuning (particularly using the Group Relative Policy Optimization, GRPO, method) and the heterogeneity in RM trustworthiness across samples. Uniform RL pressure can lead to performance degradation by amplifying reward model noise or incentivizing the policy to move toward popular distractors when the RM is unreliable for certain regions of the item space.

Failure Modes of Uniform Reward Application

Systematic analysis reveals two major pathological behaviors resulting from unselective reward propagation:

On easy samples, where the policy already ranks the ground-truth highly, RM rewards cluster tightly and GRPO advantage collapses, with residual RM noise dominating the update. This often leads to over-optimization and reward hacking (Figure 1).
On hard samples, particularly those involving long-tail or out-of-distribution items, RM miscalibration causes it to favor popular distractors over ground truth, thus pushing the policy away from correct targets due to exposure bias.
Figure 1: Two failure modes of uniform reward application: dominated by reward model noise on easy samples, and exposure bias leading to incorrect updates on hard samples.

Conditional Utility of Reward Models

The paper presents a stratified analysis using beam search to rank tokens by LLM and RM order, measuring the influence of RM on ground-truth rank. Key findings:

Aggregate RM influence is nearly zero or negative as beam width increases. As candidate set size grows, more out-of-distribution tokens are surfaced, reinforcing RM unreliability.
RM helps only conditionally: On hard samples—where the policy is uncertain and the RM exhibits clean discriminability—the RM's influence becomes substantial (e.g., +11.41 ranks at $K=50$ hard instances), but this effect is diluted in aggregate metrics.
Further conditioning on RM discriminability sharply increases per-sample RM influence but reduces coverage: Only the intersection of policy uncertainty and RM reliability justifies RL updates (e.g., nearly 60 rank improvement on $K=128$ samples, but coverage drops to approx. 13%).

AdaGRPO: Methodology

AdaGRPO reframes reward-guided optimization as selective admission rather than uniform pressure. Each training instance is anchored by supervised negative log-likelihood (NLL), but the GRPO term is gated via a binary clip determined by:

Policy-side difficulty ( $f_1$ ): The instance is hard if the ground truth falls outside the top $\tau$ percentile of model rollouts.
Reward-side discriminability ( $f_2$ ): RM reliably separates the ground truth (top $\tau$ ) from in-batch distractors (bottom $\rho$ ).

The GRPO loss is admitted only if both diagnostics are satisfied. The resulting sample-level clip ensures RL updates occur only where the reward signal is locally informative.

Figure 2: Overview of AdaGRPO: two rank-based diagnostics gate the GRPO objective per instance; only trusted samples contribute RL updates.

Design properties include interpretable, binary per-instance admission, scalability in rollout group size, and robustness to reward model quality. The scheme leverages rollout statistics directly, requiring no auxiliary models or external heuristics.

Empirical Evaluation

Offline Results

AdaGRPO was evaluated on a 175K e-commerce user-item sequence dataset (one-week window). Baseline GRPO and hybrid GRPO+NLL models suffer increasing hallucination and retrieval degradation at late training checkpoints due to reward overoptimization. AdaGRPO, by contrast, achieves:

Best HR@10: 12.18% (versus baseline 11.01%), with hallucination <0.22%
Strong stability: Maintains HR@10 of 11.63% and HR@50 of 25.43% at the final checkpoint, hallucination controlled at 0.27%
Hybrid variant with reward-discriminability further reduces hallucination and boosts retrieval accuracy
Figure 3: Offline training dynamics: AdaGRPO lifts retrieval without sacrificing validity; standard GRPO collapses to reward overoptimization.

Difficulty-stratified analysis confirms AdaGRPO’s gain is concentrated in intermediate-difficulty samples, with negligible or negative change for easiest (already confident) and hardest (reward model unreliable) cases.

Online A/B Testing

Production A/B deployment validates AdaGRPO’s practical utility:

Statistically significant gains in effective Item Page Views (IPV), click-through rate (UCTR), and dwell time compared to control and GRPO+NLL baselines.
AdaGRPO expands exposed and clicked category diversity, indicating robust coverage rather than collapse to reward-favored items.
Demonstrates sample-level clipping yields improvements not merely in offline metrics but in real user engagement.

Implications and Future Directions

AdaGRPO’s contributions lie in operationalizing conditional trust for reward models in generative recommendation RL alignment. Results suggest:

The main challenge in RL tuning of GR models is not reward design, but detection of local RM reliability.
Sample-wise adaptivity can substantially improve trade-offs between retrieval accuracy, validity, and reward maximization, preventing reward hacking and mode collapse observed in uniform RL protocols.
Methodology is robust and hyperparameter-lean, relying on rollout statistics and interpretable rank-based diagnostics.

Theoretical implications extend to RL alignment in sparse-feedback, exposure-biased domains beyond recommendation. Practically, scalable deployment requires further validation under long-term concept drift, larger/longer training windows, and more varied catalog freshness regimes. Future research should address dynamic tuning of clipping thresholds, generalize diagnostics for domains with less clear ground-truth structure, and explore the effect of batch composition on reward-side reliability checks.

Conclusion

AdaGRPO introduces a principled, sample-level trust-region mechanism for RL fine-tuning of generative recommenders, anchored on supervised NLL and gated GRPO reward admission. Empirical results demonstrate significant improvements in retrieval, minimized hallucination, and enhanced user engagement, thereby affirming that selective recognition of reward model reliability is key to robust RL-based alignment in generative recommendation systems (2606.08480).

Markdown Report Issue