PAda-PPO: Preference-Adaptive RL Framework
- The paper introduces PAda-PPO, which combines personalized supervision (via CoPeR) and group-level clustering (via M²PC) to optimize dialogue satisfaction.
- It extends PPO with a sparse terminal reward and a diversity-aware KL penalty to align policies with both individual and group preferences.
- Experimental results on ESConv demonstrate improved satisfaction prediction, especially for underrepresented user groups.
The Preference-Adaptive Reinforcement Learning Framework (PAda-PPO) is a unified methodology for satisfaction estimation in dialogue systems that addresses subjective and group-specific user preferences by integrating individualized reasoning traces with unsupervised clustering of majority and minority user groups. Built atop Proximal Policy Optimization (PPO), PAda-PPO combines personalized supervision (via Chain-of-Personalized-Reasoning, CoPeR) and per-group regularization (via Majority-Minority Preference-Aware Clustering, M²PC), jointly optimizing for both individual- and group-level satisfaction alignment.
1. Reinforcement Learning Objective and Core Modifications
PAda-PPO extends classic PPO with two primary modifications to its RL training loop:
- A sparse terminal reward based on exact prediction of the user satisfaction label, and
- A diversity-aware Kullback-Leibler (KL) penalty that regularizes the policy against a group-specific reference model.
Let denote the predicted satisfaction score, the gold label, and the current policy. The reward at terminal timestep is given by:
(Eq.(7))
At every time , the diversity-aware KL penalty is
(Eq.(8))
where is a group-conditioned reference (majority or minority), with determined by perplexity routing via M²PC.
The per-step total reward is
(Eq.(9))
Generalized Advantage Estimation (GAE) is used for computing advantages with discount factor and GAE parameter (Eq.(10)):
Value loss employs clipping
(Eq.(11))
The policy loss follows PPO’s clipped surrogate objective:
(Eq.(13))
with .
The total PAda-PPO loss is:
(Eq.(15))
These design elements jointly enforce accuracy in preference satisfaction prediction and regularize policy updates with respect to preferences at both the individual and group levels.
2. Individual Preference Conditioning via Chain-of-Personalized-Reasoning (CoPeR)
CoPeR introduces personalized reasoning supervision into PAda-PPO in two phases:
- Supervised Fine-Tuning (SFT) Stage: The base model is conditioned on a user-specific Chain-of-Thought prompt , and trained to generate a tuple , plus the final score . This prepares the network to encode explicit, user-specific reasoning traces.
- Reinforcement Learning (RL) Stage: The CoPeR prefix is prepended to the input of each state , i.e.,
The output sequence must include the final score , which is subject to the terminal reward. There is no explicit CoPeR supervision beyond this reward, but the model is incentivized to preserve the personalized reasoning chain learned during SFT.
PAda-PPO enforces this via the reward: This encourages the policy to leverage CoPeR-derived latent intent and preference alignment.
3. Group-Level Preference Discovery Using Majority-Minority Preference-Aware Clustering (M²PC)
M²PC is an expectation-maximization (EM) algorithm that discovers two user clusters ("majority" and "minority") in an unsupervised manner using LLM perplexity as the assignment signal. Two reference models and are fine-tuned on the respective clusters for use in KL regularization during RL training.
The procedure is as follows:
- E-step: For each user , assign to cluster with lowest perplexity:
- M-step: Fine-tune each on dialogues assigned to via negative log-likelihood:
Upon completion, each reference model encodes group-specific linguistic and preference trends, which are subsequently used for per-sample KL regularization of PAda-PPO’s policy, with the group assignment determined at inference time by perplexity comparison (perplexity routing).
4. End-to-End Training Pipeline and Practical Details
The comprehensive PAda-PPO training procedure comprises three sequential stages:
- Supervised Fine-Tuning (with CoPeR):
- Model:
- Data:
- Hyperparameters: learning rate , batch size $8$, epochs , LoRA(r=16, =16, dropout=0.1)
- Group-Specific Reference Model Construction (M²PC):
- Initialize: ,
- EM iterations: 10 (batch size 2, lr )
- PAda-PPO RL Training:
- Initialize: Policy , value head
- PPO hyperparameters: , , , , , batch size 2, epochs 5, learning rate
- Per batch:
- For each input , assign (majority or minority) by lower perplexity.
- Collect trajectories; apply reward, KL penalty as above.
- Update by minimizing PPO loss.
End-to-end pseudocode is provided in the original source, with explicit algorithmic steps and hyperparameter values.
5. Experimental Protocol and Performance Metrics
Experiments are performed on the Emotional Support Conversation dataset (ESConv), consisting of 1300 conversations and 38k utterances, with an 8:1:1 split for train, validation, and test partitions, respectively.
- Reward scaling: Terminal reward ; all intermediates zero.
- Evaluation metrics:
- , (class-wise F1 for "low"/"high" satisfaction)
- Macro- (): unweighted mean
- Weighted- ()
- System configuration: 4 × RTX A6000 GPUs, DeepSpeed ZeRO2 + Accelerate; all LoRA and mixed precision options as specified.
The central experimental finding is that PAda-PPO demonstrates consistent improvements in user satisfaction estimation across these metrics, with particular gains for underrepresented (minority) user groups.
6. Integration and Theoretical Implications
The PAda-PPO framework consolidates individualized and group-level preference modeling within a single reinforcement learning paradigm. Through (a) interpretable, user-specific reasoning (via CoPeR) and (b) unsupervised group discovery with tailored regularization (via M²PC), the approach directly addresses limitations of uniform alignment in traditional dialogue system optimization.
A plausible implication is that frameworks jointly supervising for both individual and group variation can better preserve minority perspectives and nuanced user satisfaction, particularly in settings where subjective preference heterogeneity is critical. This suggests potential for broader adoption in alignment-sensitive generative modeling tasks where group fairness or personalization are explicit goals.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free