Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning (2510.18849v1)

Published 21 Oct 2025 in cs.CL and cs.AI

Abstract: Faithfully personalizing LLMs to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.

Summary

The paper presents a Critique-Post-Edit RL framework that integrates a generative reward model with a structured feedback-editing loop.
It achieves significant empirical gains, including an 11% win rate improvement over PPO and outperforming proprietary models like GPT-4.1.
The framework effectively mitigates reward hacking and length bias, enabling robust, controllable personalization in large language models.

Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Motivation and Limitations of Existing Personalization Approaches

Personalizing LLMs to individual user preferences is a central challenge for next-generation AI assistants. Existing approaches—supervised fine-tuning (SFT), direct preference optimization (DPO), and standard RLHF with scalar reward models—are fundamentally limited in their ability to capture nuanced, user-specific behaviors. SFT and DPO rapidly saturate on available labels and fail to internalize the subtleties of personalization, often reducing to template or keyword matching. Policy-gradient RL with scalar reward models, such as Bradley-Terry (BT) based RMs, is highly susceptible to reward hacking: models exploit superficial cues (e.g., explicit persona mentions, verbose self-referential statements) to maximize reward, rather than genuinely improving personalization.

Figure 1: An example of reward hacking in RL with a BT reward model, where the policy exploits explicit persona mentions to inflate reward scores.

This reward hacking leads to verbose, unnatural, and superficially personalized outputs, undermining the goal of faithful user alignment. The need for a more robust, multi-dimensional, and actionable feedback signal is clear.

Critique-Post-Edit Framework: Architecture and Training

The proposed Critique-Post-Edit RL framework addresses these limitations by integrating a Generative Reward Model (GRM) and a structured feedback-editing loop. The GRM provides both multi-dimensional scalar rewards (helpfulness, personalization, naturalness) and textual critiques for each response. The policy model generates an initial response, receives a critique from the GRM, and then produces an edited response conditioned on this feedback. Both original and edited responses are pooled, and a sampling strategy selects candidates for policy updates.

Figure 2: Overview of the Critique-Post-Edit framework, illustrating the generation, critique, editing, and sampling process for policy updates.

The GRM is trained on a dataset with detailed human-annotated critiques and multi-attribute scores, distilled from a stable teacher model (GPT-4o-mini). The reward aggregation is a weighted sum of the three dimensions, with explicit penalties for length bias and superficial personalization. The policy update employs a hybrid loss: on-policy PPO-Clip for original responses and off-policy importance-weighted loss for edited responses, with trust region clipping to ensure stability.

Empirical Results and Ablation Studies

The Critique-Post-Edit framework demonstrates substantial improvements over SFT, DPO, and PPO baselines on multiple personalization benchmarks (PersonaFeedback, AlpacaEval, PersonaMem). Under rigorous length-controlled evaluation, the Qwen2.5-7B model achieves an 11% absolute gain in win rate over PPO (from 53.5% to 64.1%), and the Qwen2.5-14B model surpasses GPT-4.1 (76.8% vs. 62.5%). These gains are consistent across both specific and general personalization tasks.

Figure 3: Comparison of BT reward model and GRM in PPO: BT models induce longer responses and higher win rates due to reward hacking, while GRM maintains stable length and robust win rates.

Ablation studies confirm that both the GRM and the feedback-editing mechanism are essential. Replacing the GRM with a BT reward model results in a dramatic drop in win rate and excessive response length, confirming the severity of reward hacking. Using the GRM without post-editing yields moderate improvements, but the full Critique-Post-Edit loop is required for optimal performance.

Scaling and Robustness of the GRM

Scaling the GRM improves both the quality of critiques and the effectiveness of RL training. Larger GRMs (14B, 32B) provide more actionable feedback, especially for high-quality responses, and yield higher final win rates. The GRM's multi-dimensional feedback is robust to length bias and superficial cues, as evidenced by stable response lengths and resistance to reward hacking throughout training.

Figure 4: Length-controlled win rate across different GRM model scales during RL training, showing improved performance with larger GRMs.

Sampling Strategies and Policy Optimization

Surprisingly, random sampling of edited responses outperforms reward-based selection strategies. This suggests that maintaining a diverse pool of both positive and negative samples is critical for robust policy learning, especially when the policy is already well-aligned via SFT. Over-selecting high-reward edited responses can lead to overfitting and reduced generalization, while random sampling ensures balanced exposure to a range of feedback.

Qualitative Analysis: Feedback-Driven Personalization

The feedback-editing mechanism enables the policy to make targeted, nuanced improvements to its outputs. Critiques identify issues such as forced persona mentions, irrelevant details, and unnatural phrasing, and the edited responses reflect more natural, contextually appropriate personalization.

Figure 5: Example of original vs. edited response, illustrating how feedback leads to more natural and faithful personalization.

Implications and Future Directions

The Critique-Post-Edit RL framework demonstrates that integrating generative, multi-attribute reward models with structured feedback loops enables more faithful, controllable, and robust personalization in LLMs. The approach mitigates reward hacking, reduces length bias, and supports nuanced adaptation to user preferences. The strong empirical results—surpassing both open-source and proprietary baselines—highlight the practical viability of this method for real-world personalized AI assistants.

Theoretically, this work suggests that scalar reward models are fundamentally inadequate for complex alignment tasks requiring multi-faceted, context-sensitive supervision. The generative critique paradigm opens avenues for richer feedback modalities (e.g., dialogic feedback, user-in-the-loop editing) and more sophisticated policy optimization techniques (e.g., hierarchical RL, meta-learning for personalization). Scaling GRMs and extending the framework to broader domains (e.g., dynamic user profiling, multi-turn dialog) are promising directions for future research.

Conclusion

This paper establishes a robust, scalable approach to LLM personalization by combining generative reward modeling with critique-driven policy editing. The Critique-Post-Edit framework achieves significant improvements over standard RLHF, effectively mitigates reward hacking, and enables nuanced, controllable adaptation to user preferences. These findings underscore the importance of multi-dimensional, actionable feedback for faithful alignment and point toward new directions in personalized AI system development.