Papers
Topics
Authors
Recent
2000 character limit reached

PAda-PPO: Preference-Adaptive RL Framework

Updated 14 November 2025
  • The paper introduces PAda-PPO, which combines personalized supervision (via CoPeR) and group-level clustering (via M²PC) to optimize dialogue satisfaction.
  • It extends PPO with a sparse terminal reward and a diversity-aware KL penalty to align policies with both individual and group preferences.
  • Experimental results on ESConv demonstrate improved satisfaction prediction, especially for underrepresented user groups.

The Preference-Adaptive Reinforcement Learning Framework (PAda-PPO) is a unified methodology for satisfaction estimation in dialogue systems that addresses subjective and group-specific user preferences by integrating individualized reasoning traces with unsupervised clustering of majority and minority user groups. Built atop Proximal Policy Optimization (PPO), PAda-PPO combines personalized supervision (via Chain-of-Personalized-Reasoning, CoPeR) and per-group regularization (via Majority-Minority Preference-Aware Clustering, M²PC), jointly optimizing for both individual- and group-level satisfaction alignment.

1. Reinforcement Learning Objective and Core Modifications

PAda-PPO extends classic PPO with two primary modifications to its RL training loop:

  • A sparse terminal reward based on exact prediction of the user satisfaction label, and
  • A diversity-aware Kullback-Leibler (KL) penalty that regularizes the policy against a group-specific reference model.

Let ss denote the predicted satisfaction score, yy the gold label, and G(ξ)G(\xi) the current policy. The reward at terminal timestep TT is given by:

rT={+1if s=y, 1otherwise,rt=0(t<T)r_T = \begin{cases} +1 & \text{if } s = y, \ -1 & \text{otherwise} \end{cases} ,\qquad r_t = 0\quad (t < T)

(Eq.(7))

At every time tt, the diversity-aware KL penalty is

KLt=DKL(G(ξ)(st,m)    G(β)m(st))\mathrm{KL}_t = D_{\mathrm{KL}}\bigl( G(\xi)(\cdot \mid s_t, m)\;\|\; G(\beta)_m(\cdot \mid s_t) \bigr)

(Eq.(8))

where G(β)mG(\beta)_m is a group-conditioned reference (majority or minority), with mm determined by perplexity routing via M²PC.

The per-step total reward is

rttotal=rtλKLKLtr_t^{\mathrm{total}} = r_t - \lambda_{\mathrm{KL}}\, \mathrm{KL}_t

(Eq.(9))

Generalized Advantage Estimation (GAE) is used for computing advantages with discount factor γ\gamma and GAE parameter λ\lambda (Eq.(10)):

δt=rttotal+γVϕold(st+1)Vϕold(st) A^t=l=0Tt(γλ)lδt+l\begin{aligned} \delta_t &= r_t^{\mathrm{total}} + \gamma\,V_{\phi_{\mathrm{old}}}(s_{t+1}) - V_{\phi_{\mathrm{old}}}(s_t) \ \hat{A}_t &= \sum_{l=0}^{T-t} (\gamma\lambda)^l\,\delta_{t+l} \end{aligned}

Value loss employs clipping

Lvalue(ϕ)=12Et[max((Vϕ(st)R^t)2,  (clip(Vϕ(st)Vϕold(st),ϵ,ϵ)R^t)2)]\mathcal{L}_{\mathrm{value}}(\phi) = \tfrac{1}{2}\,\mathbb{E}_t\biggl[ \max \biggl( (V_{\phi}(s_t)-\hat R_t)^2,\; \Bigl(\mathrm{clip}(V_{\phi}(s_t) - V_{\phi_{\mathrm{old}}}(s_t), -\epsilon, \epsilon) - \hat R_t\Bigr)^2 \biggr) \biggr]

(Eq.(11))

The policy loss follows PPO’s clipped surrogate objective:

Lpolicy(ξ)=Et[min(ρt(ξ)A^t,  clip(ρt(ξ),1ϵ,1+ϵ)A^t)]\mathcal{L}_{\mathrm{policy}}(\xi) = -\mathbb{E}_t\Bigl[ \min\Bigl( \rho_t(\xi)\,\hat{A}_t,\; \mathrm{clip}(\rho_t(\xi), 1-\epsilon, 1+\epsilon)\,\hat{A}_t \Bigr) \Bigr]

(Eq.(13))

with ρt(ξ)=G(ξ)(atst,m)G(ξold)(atst,m)\rho_t(\xi) = \frac{G(\xi)(a_t\mid s_t,m)}{G(\xi_{\mathrm{old}})(a_t\mid s_t,m)}.

The total PAda-PPO loss is:

LPPO(ξ,ϕ)=Lpolicy(ξ)+cVFLvalue(ϕ)\mathcal{L}_{\mathrm{PPO}}(\xi, \phi) = \mathcal{L}_{\mathrm{policy}}(\xi) + c_{\mathrm{VF}}\,\mathcal{L}_{\mathrm{value}}(\phi)

(Eq.(15))

These design elements jointly enforce accuracy in preference satisfaction prediction and regularize policy updates with respect to preferences at both the individual and group levels.

2. Individual Preference Conditioning via Chain-of-Personalized-Reasoning (CoPeR)

CoPeR introduces personalized reasoning supervision into PAda-PPO in two phases:

  1. Supervised Fine-Tuning (SFT) Stage: The base model G(θ)G(\theta) is conditioned on a user-specific Chain-of-Thought prompt rucotr_{\mathrm{ucot}}, and trained to generate a tuple (rintent,rstrategy,rmatch,rreason)(r_{\mathrm{intent}}, r_{\mathrm{strategy}}, r_{\mathrm{match}}, r_{\mathrm{reason}}), plus the final score ss. This prepares the network to encode explicit, user-specific reasoning traces.
  2. Reinforcement Learning (RL) Stage: The CoPeR prefix is prepended to the input of each state s0s_0, i.e.,

x=[crucot]x = [c \,\|\, r_{\mathrm{ucot}}]

The output sequence must include the final score ss, which is subject to the terminal reward. There is no explicit CoPeR supervision beyond this reward, but the model is incentivized to preserve the personalized reasoning chain learned during SFT.

PAda-PPO enforces this via the reward: R(x;ξ)=1{sξ(x)=y}×(+1)(11{s=y})R(x; \xi) = \mathbf{1}\{ s_\xi(x) = y \} \times (+1) - (1 - \mathbf{1}\{ s = y \}) This encourages the policy to leverage CoPeR-derived latent intent and preference alignment.

3. Group-Level Preference Discovery Using Majority-Minority Preference-Aware Clustering (M²PC)

M²PC is an expectation-maximization (EM) algorithm that discovers two user clusters ("majority" and "minority") in an unsupervised manner using LLM perplexity as the assignment signal. Two reference models G(β)MajorG(\beta)_{\mathrm{Major}} and G(β)MinorG(\beta)_{\mathrm{Minor}} are fine-tuned on the respective clusters for use in KL regularization during RL training.

The procedure is as follows:

  • E-step: For each user ii, assign to cluster kk with lowest perplexity: li(t)=argmink{Major,Minor}PPL(Di;G(β)k(t))l_i^{(t)} = \arg\min_{k \in \{\mathrm{Major}, \mathrm{Minor}\}} \mathrm{PPL}\bigl(D_i; G(\beta)_k^{(t)}\bigr)

PPL(Di;G)=exp(1DiwDilogPG(w))\mathrm{PPL}(D_i; G) = \exp\bigl( -\frac{1}{|D_i|} \sum_{w \in D_i} \log P_G(w) \bigr )

  • M-step: Fine-tune each G(β)kG(\beta)_k on dialogues assigned to kk via negative log-likelihood: G(β)k(t+1)=argminGi:li(t)=kLNLL(G;Di)G(\beta)_k^{(t+1)} = \arg\min_{G} \sum_{i: l_i^{(t)} = k} \mathcal{L}_{\mathrm{NLL}}(G; D_i)

Upon completion, each reference model encodes group-specific linguistic and preference trends, which are subsequently used for per-sample KL regularization of PAda-PPO’s policy, with the group assignment determined at inference time by perplexity comparison (perplexity routing).

4. End-to-End Training Pipeline and Practical Details

The comprehensive PAda-PPO training procedure comprises three sequential stages:

  1. Supervised Fine-Tuning (with CoPeR):
    • Model: G(θ)G(\theta)
    • Data: (c+rucot)i(rcoperi,yi)(c + r_{\mathrm{ucot}})_i \rightarrow (r_{\mathrm{coper}_i}, y_i)
    • Hyperparameters: learning rate 1×1041 \times 10^{-4}, batch size $8$, epochs 15\leq 15, LoRA(r=16, α\alpha=16, dropout=0.1)
  2. Group-Specific Reference Model Construction (M²PC):
    • Initialize: G(β)MajorG(θ)G(\beta)_{\mathrm{Major}} \leftarrow G(\theta), G(β)MinorG(θ)G(\beta)_{\mathrm{Minor}} \leftarrow G(\theta)
    • EM iterations: 10 (batch size 2, lr 1×1051 \times 10^{-5})
  3. PAda-PPO RL Training:
    • Initialize: Policy G(ξ)G(θ)G(\xi) \leftarrow G(\theta), value head VϕV_\phi
    • PPO hyperparameters: λKL=0.2\lambda_{\mathrm{KL}}=0.2, γ=0.95\gamma=0.95, λGAE=1\lambda_{\text{GAE}}=1, ϵ=0.2\epsilon=0.2, cVF=0.1c_{\text{VF}}=0.1, batch size 2, epochs 5, learning rate 3×1073 \times 10^{-7}
    • Per batch:
      • For each input ci+rucotic_i + r_{\mathrm{ucot}_i}, assign mim_i (majority or minority) by lower perplexity.
      • Collect trajectories; apply reward, KL penalty as above.
      • Update (ξ,ϕ)(\xi, \phi) by minimizing PPO loss.

End-to-end pseudocode is provided in the original source, with explicit algorithmic steps and hyperparameter values.

5. Experimental Protocol and Performance Metrics

Experiments are performed on the Emotional Support Conversation dataset (ESConv), consisting of 1300 conversations and 38k utterances, with an 8:1:1 split for train, validation, and test partitions, respectively.

  • Reward scaling: Terminal reward +1/1+1/-1; all intermediates zero.
  • Evaluation metrics:
    • F1lowF_1^{\mathrm{low}}, F1highF_1^{\mathrm{high}} (class-wise F1 for "low"/"high" satisfaction)
    • Macro-F1F_1 (F1mF_1^{\mathrm{m}}): unweighted mean
    • Weighted-F1F_1 (F1wF_1^{\mathrm{w}})
  • System configuration: 4 × RTX A6000 GPUs, DeepSpeed ZeRO2 + Accelerate; all LoRA and mixed precision options as specified.

The central experimental finding is that PAda-PPO demonstrates consistent improvements in user satisfaction estimation across these metrics, with particular gains for underrepresented (minority) user groups.

6. Integration and Theoretical Implications

The PAda-PPO framework consolidates individualized and group-level preference modeling within a single reinforcement learning paradigm. Through (a) interpretable, user-specific reasoning (via CoPeR) and (b) unsupervised group discovery with tailored regularization (via M²PC), the approach directly addresses limitations of uniform alignment in traditional dialogue system optimization.

A plausible implication is that frameworks jointly supervising for both individual and group variation can better preserve minority perspectives and nuanced user satisfaction, particularly in settings where subjective preference heterogeneity is critical. This suggests potential for broader adoption in alignment-sensitive generative modeling tasks where group fairness or personalization are explicit goals.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Preference-Adaptive Reinforcement Learning Framework (PAda-PPO).