PAda-PPO: Preference-Adaptive RL Framework

Updated 14 November 2025

The paper introduces PAda-PPO, which combines personalized supervision (via CoPeR) and group-level clustering (via M²PC) to optimize dialogue satisfaction.
It extends PPO with a sparse terminal reward and a diversity-aware KL penalty to align policies with both individual and group preferences.
Experimental results on ESConv demonstrate improved satisfaction prediction, especially for underrepresented user groups.

The Preference-Adaptive Reinforcement Learning Framework (PAda-PPO) is a unified methodology for satisfaction estimation in dialogue systems that addresses subjective and group-specific user preferences by integrating individualized reasoning traces with unsupervised clustering of majority and minority user groups. Built atop Proximal Policy Optimization (PPO), PAda-PPO combines personalized supervision (via Chain-of-Personalized-Reasoning, CoPeR) and per-group regularization (via Majority-Minority Preference-Aware Clustering, M²PC), jointly optimizing for both individual- and group-level satisfaction alignment.

1. Reinforcement Learning Objective and Core Modifications

PAda-PPO extends classic PPO with two primary modifications to its RL training loop:

A sparse terminal reward based on exact prediction of the user satisfaction label, and
A diversity-aware Kullback-Leibler (KL) penalty that regularizes the policy against a group-specific reference model.

Let $s$ denote the predicted satisfaction score, $y$ the gold label, and $G(\xi)$ the current policy. The reward at terminal timestep $T$ is given by:

$r_T = \begin{cases} +1 & \text{if } s = y, \ -1 & \text{otherwise} \end{cases} ,\qquad r_t = 0\quad (t < T)$

(Eq.(7))

At every time $t$ , the diversity-aware KL penalty is

$\mathrm{KL}_t = D_{\mathrm{KL}}\bigl( G(\xi)(\cdot \mid s_t, m)\;\|\; G(\beta)_m(\cdot \mid s_t) \bigr)$

(Eq.(8))

where $G(\beta)_m$ is a group-conditioned reference (majority or minority), with $m$ determined by perplexity routing via M²PC.

The per-step total reward is

$r_t^{\mathrm{total}} = r_t - \lambda_{\mathrm{KL}}\, \mathrm{KL}_t$

(Eq.(9))

Generalized Advantage Estimation (GAE) is used for computing advantages with discount factor $\gamma$ and GAE parameter $\lambda$ (Eq.(10)):

$\begin{aligned} \delta_t &= r_t^{\mathrm{total}} + \gamma\,V_{\phi_{\mathrm{old}}}(s_{t+1}) - V_{\phi_{\mathrm{old}}}(s_t) \ \hat{A}_t &= \sum_{l=0}^{T-t} (\gamma\lambda)^l\,\delta_{t+l} \end{aligned}$

Value loss employs clipping

$\mathcal{L}_{\mathrm{value}}(\phi) = \tfrac{1}{2}\,\mathbb{E}_t\biggl[ \max \biggl( (V_{\phi}(s_t)-\hat R_t)^2,\; \Bigl(\mathrm{clip}(V_{\phi}(s_t) - V_{\phi_{\mathrm{old}}}(s_t), -\epsilon, \epsilon) - \hat R_t\Bigr)^2 \biggr) \biggr]$

(Eq.(11))

The policy loss follows PPO’s clipped surrogate objective:

$\mathcal{L}_{\mathrm{policy}}(\xi) = -\mathbb{E}_t\Bigl[ \min\Bigl( \rho_t(\xi)\,\hat{A}_t,\; \mathrm{clip}(\rho_t(\xi), 1-\epsilon, 1+\epsilon)\,\hat{A}_t \Bigr) \Bigr]$

(Eq.(13))

with $\rho_t(\xi) = \frac{G(\xi)(a_t\mid s_t,m)}{G(\xi_{\mathrm{old}})(a_t\mid s_t,m)}$ .

The total PAda-PPO loss is:

$\mathcal{L}_{\mathrm{PPO}}(\xi, \phi) = \mathcal{L}_{\mathrm{policy}}(\xi) + c_{\mathrm{VF}}\,\mathcal{L}_{\mathrm{value}}(\phi)$

(Eq.(15))

These design elements jointly enforce accuracy in preference satisfaction prediction and regularize policy updates with respect to preferences at both the individual and group levels.

2. Individual Preference Conditioning via Chain-of-Personalized-Reasoning (CoPeR)

CoPeR introduces personalized reasoning supervision into PAda-PPO in two phases:

Supervised Fine-Tuning (SFT) Stage: The base model $G(\theta)$ is conditioned on a user-specific Chain-of-Thought prompt $r_{\mathrm{ucot}}$ , and trained to generate a tuple $(r_{\mathrm{intent}}, r_{\mathrm{strategy}}, r_{\mathrm{match}}, r_{\mathrm{reason}})$ , plus the final score $s$ . This prepares the network to encode explicit, user-specific reasoning traces.
Reinforcement Learning (RL) Stage: The CoPeR prefix is prepended to the input of each state $s_0$ , i.e.,

$x = [c \,\|\, r_{\mathrm{ucot}}]$

The output sequence must include the final score $s$ , which is subject to the terminal reward. There is no explicit CoPeR supervision beyond this reward, but the model is incentivized to preserve the personalized reasoning chain learned during SFT.

PAda-PPO enforces this via the reward: $R(x; \xi) = \mathbf{1}\{ s_\xi(x) = y \} \times (+1) - (1 - \mathbf{1}\{ s = y \})$ This encourages the policy to leverage CoPeR-derived latent intent and preference alignment.

3. Group-Level Preference Discovery Using Majority-Minority Preference-Aware Clustering (M²PC)

M²PC is an expectation-maximization (EM) algorithm that discovers two user clusters ("majority" and "minority") in an unsupervised manner using LLM perplexity as the assignment signal. Two reference models $G(\beta)_{\mathrm{Major}}$ and $G(\beta)_{\mathrm{Minor}}$ are fine-tuned on the respective clusters for use in KL regularization during RL training.

The procedure is as follows:

E-step: For each user $i$ , assign to cluster $k$ with lowest perplexity: $l_i^{(t)} = \arg\min_{k \in \{\mathrm{Major}, \mathrm{Minor}\}} \mathrm{PPL}\bigl(D_i; G(\beta)_k^{(t)}\bigr)$

$\mathrm{PPL}(D_i; G) = \exp\bigl( -\frac{1}{|D_i|} \sum_{w \in D_i} \log P_G(w) \bigr )$

M-step: Fine-tune each $G(\beta)_k$ on dialogues assigned to $k$ via negative log-likelihood: $G(\beta)_k^{(t+1)} = \arg\min_{G} \sum_{i: l_i^{(t)} = k} \mathcal{L}_{\mathrm{NLL}}(G; D_i)$

Upon completion, each reference model encodes group-specific linguistic and preference trends, which are subsequently used for per-sample KL regularization of PAda-PPO’s policy, with the group assignment determined at inference time by perplexity comparison (perplexity routing).

4. End-to-End Training Pipeline and Practical Details

The comprehensive PAda-PPO training procedure comprises three sequential stages:

Supervised Fine-Tuning (with CoPeR):
- Model: $G(\theta)$
- Data: $(c + r_{\mathrm{ucot}})_i \rightarrow (r_{\mathrm{coper}_i}, y_i)$
- Hyperparameters: learning rate $1 \times 10^{-4}$ , batch size $8$, epochs $\leq 15$ , LoRA(r=16, $\alpha$ =16, dropout=0.1)
Group-Specific Reference Model Construction (M²PC):
- Initialize: $G(\beta)_{\mathrm{Major}} \leftarrow G(\theta)$ , $G(\beta)_{\mathrm{Minor}} \leftarrow G(\theta)$
- EM iterations: 10 (batch size 2, lr $1 \times 10^{-5}$ )
PAda-PPO RL Training:
- Initialize: Policy $G(\xi) \leftarrow G(\theta)$ , value head $V_\phi$
- PPO hyperparameters: $\lambda_{\mathrm{KL}}=0.2$ , $\gamma=0.95$ , $\lambda_{\text{GAE}}=1$ , $\epsilon=0.2$ , $c_{\text{VF}}=0.1$ , batch size 2, epochs 5, learning rate $3 \times 10^{-7}$
- Per batch:
  - For each input $c_i + r_{\mathrm{ucot}_i}$ , assign $m_i$ (majority or minority) by lower perplexity.
  - Collect trajectories; apply reward, KL penalty as above.
  - Update $(\xi, \phi)$ by minimizing PPO loss.

End-to-end pseudocode is provided in the original source, with explicit algorithmic steps and hyperparameter values.

5. Experimental Protocol and Performance Metrics

Experiments are performed on the Emotional Support Conversation dataset (ESConv), consisting of 1300 conversations and 38k utterances, with an 8:1:1 split for train, validation, and test partitions, respectively.

Reward scaling: Terminal reward $+1/-1$ ; all intermediates zero.
Evaluation metrics:
- $F_1^{\mathrm{low}}$ , $F_1^{\mathrm{high}}$ (class-wise F1 for "low"/"high" satisfaction)
- Macro- $F_1$ ( $F_1^{\mathrm{m}}$ ): unweighted mean
- Weighted- $F_1$ ( $F_1^{\mathrm{w}}$ )
System configuration: 4 × RTX A6000 GPUs, DeepSpeed ZeRO2 + Accelerate; all LoRA and mixed precision options as specified.

The central experimental finding is that PAda-PPO demonstrates consistent improvements in user satisfaction estimation across these metrics, with particular gains for underrepresented (minority) user groups.

6. Integration and Theoretical Implications

The PAda-PPO framework consolidates individualized and group-level preference modeling within a single reinforcement learning paradigm. Through (a) interpretable, user-specific reasoning (via CoPeR) and (b) unsupervised group discovery with tailored regularization (via M²PC), the approach directly addresses limitations of uniform alignment in traditional dialogue system optimization.

A plausible implication is that frameworks jointly supervising for both individual and group variation can better preserve minority perspectives and nuanced user satisfaction, particularly in settings where subjective preference heterogeneity is critical. This suggests potential for broader adoption in alignment-sensitive generative modeling tasks where group fairness or personalization are explicit goals.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Preference-Adaptive Reinforcement Learning Framework (PAda-PPO).