Papers
Topics
Authors
Recent
2000 character limit reached

Unbiased Chain Preference Grouping

Updated 8 January 2026
  • Unbiased chain preference grouping is a strategy within the CSO framework that balances emotional support strategy candidates to mitigate bias in multi-turn LLM dialogues.
  • It employs Monte Carlo Tree Search to systematically explore and group candidate strategies based on performance metrics, ensuring diverse and fair selection.
  • Empirical evaluations show improved macro-F1 scores and reduced bias metrics, resulting in more contextually appropriate and empathetic responses.

Unbiased chain preference grouping is an approach within the Chain-of-Strategy Optimization (CSO) framework that enables LLMs to select emotional-support strategies in multi-turn dialogue without systematically favoring particular classes of strategies. This methodology addresses preference bias during learning by constructing, at each dialogue turn, balanced sets of strategy-response pairs for fine-grained policy supervision. Core to the approach is the use of Monte Carlo Tree Search (MCTS) to explore and organize strategy candidates, ensuring coverage and diversity in both preferred and non-preferred examples, thereby facilitating empathy and adaptability in emotional support conversations (Zhao et al., 7 Mar 2025).

1. Problem Definition and Motivation

The primary goal is to learn a policy πθ\pi_\theta that ranks a set of emotional-support strategies S={s1,,sK}S = \{s_1, \dots, s_K\} at each dialogue turn ii, such that strategy selection adapts to conversational context rather than exhibiting systematic bias (e.g., overusing "Emotional Validation" irrespective of need). In preference-supervised learning, for each turn ii under context (x,Hi1)(x, H^{i-1}), strategy pairs are extracted: a more-preferred strategy sws_w and one or more less-preferred strategies sls_l. The learning process benefits from these fine-grained, turn-level comparisons.

Bias in preference ranking is quantified using the metric: Bi=sSp^θi(s)pi(s)B_i = \sum_{s \in S} \left| \hat{p}_\theta^i(s) - p_i^*(s) \right| where p^θi(s)\hat{p}_\theta^i(s) is the empirical model selection frequency and pi(s)p_i^*(s) is the ideal, context-dependent frequency for strategy ss at turn ii. The average B=(1/L)iBiB = (1/L) \sum_i B_i gives an overall measure across the dialogue.

The unbiased grouping objective at each turn is to construct, for every ii, a balanced set of positives (high-quality, preferred) and negatives (low-quality, non-preferred) from candidate strategies. This ensures the model learns to discriminate effectively while avoiding skewed exposure that would amplify bias.

2. Mathematical & Algorithmic Foundations

Candidate strategies at each turn are explored using an MCTS process with the following PUCB value function: PUCB(S)=Q(S)+cP(S)N(parent(S))N(S)+1\text{PUCB}(S) = Q(S) + c \cdot P(S) \frac{\sqrt{N(\text{parent}(S))}}{N(S)+1} Here P(S)P(S) is the normalized strategy-LLM score, Q(S)Q(S) is the average rollout reward, and cc modulates exploration.

The rollout reward combines several dimensions: R=E+I+H+αS10+bR = \frac{E + I + H + \alpha S}{10} + b where EE, II, HH, and SS denote empathy, information, humanoid, and strategy scores, respectively; α\alpha and bb are constants for weighting and normalization.

Valid strategy-response paths PP are mined by requiring Q(Si)>θQ(S_i) > \theta at all turns and that the terminal node is reached. Preference datasets D\mathcal{D} are then formed by pairing high-QQ siblings (preferred) with low-QQ siblings (non-preferred) at each layer: D=Pvalid{(Sw,Sl)Sw,SlP, Q(Sw)>θ, Q(Sl)<θ}\mathcal{D} = \bigcup_{P \in \text{valid}} \{(S_w, S_l) \mid S_w, S_l \in P,\ Q(S_w) > \theta,\ Q(S_l)<\theta \}

CSO employs a turn-level DPO (Direct Preference Optimization) loss: rw=πθ(Swix,Hi1)πref(Swix,Hi1),rl=πθ(Slix,Hi1)πref(Slix,Hi1)r_w = \frac{\pi_\theta(S_w^i \mid x, H^{i-1})}{\pi_{\text{ref}}(S_w^i \mid x, H^{i-1})},\quad r_l = \frac{\pi_\theta(S_l^i \mid x, H^{i-1})}{\pi_{\text{ref}}(S_l^i \mid x, H^{i-1})}

Li=logσ[β(logrwlogrl)]\mathcal{L}_i = -\log\sigma\left[\beta \cdot (\log r_w - \log r_l)\right]

The total objective aggregates over all pairs and turns: LCSO=E(x,Hi1,Swi,Sli)D[Li]\mathcal{L}_{\text{CSO}} = \mathbb{E}_{(x,H^{i-1},S_w^i,S_l^i)\sim\mathcal{D}} [\mathcal{L}_i] An additional regularizer can be introduced to penalize per-turn deviation from balanced strategy selection: Rbias=λi=1Lβi,βi=sp^θi(s)pˉ(s)R_{\text{bias}} = \lambda \sum_{i=1}^L \beta_i,\quad \beta_i = \sum_s |\hat{p}_\theta^i(s) - \bar{p}(s)| where pˉ(s)\bar{p}(s) is a uniform or gold-standard distribution.

3. Candidate Group Construction via MCTS

The MCTS framework steps are as follows:

  • Node Encoding: Each tree node represents a partial conversation; siblings at depth ii correspond to different strategy choices at turn ii.
  • Candidate Generation: Expansion at each node uses the Strategy LLM to propose KK strategies; child nodes initialize with Q=0Q=0, N=0N=0.
  • Rollouts: Simulated dialogues are generated where the Supporter LLM and Seeker LLM interact for a fixed number of steps or until an end token.
  • Scoring and Backpropagation: The Reward LLM scores completed rollouts; rewards are propagated up the tree to update Q-values.
  • Iterative Expansion: This process continues until the search budget is exhausted or sufficient terminal paths are constructed.

Valid paths with Q(Si)>θiQ(S_i) > \theta \forall i are extracted. At each layer, explored sibling nodes (i.e., alternative strategies for turn ii) are available for grouping.

4. Unbiased Grouping and Data Balancing Protocols

At each dialogue turn, candidate strategies are grouped by thresholding their Q-values at θ\theta to form high-Q (preferred) and low-Q (non-preferred) sets. To avoid situations where a strategy is never included in the negative set—thus risking overfitting toward that strategy—an explicit balancing procedure is implemented:

  • If any strategy ss does not appear in the low-Q group in the initial search, an artificial negative is generated by forcing ss as the strategy in a sibling path and labeling it non-preferred.
  • To prevent imbalance, the group sizes of preferred and non-preferred sets are capped and subsampled as necessary for approximate parity.

The resulting dataset ensures that all strategies are adequately represented in both positive and negative preference contexts, directly addressing potential bias amplification.

5. Bias Quantification and Mitigation Mechanisms

Bias during training is monitored using the per-turn deviation metric βi=sp^θi(s)pˉ(s)\beta_i = \sum_s |\hat{p}_\theta^i(s) - \bar{p}(s)|, where p^θi\hat{p}_\theta^i reflects empirical frequencies and pˉ(s)\bar{p}(s) is the target (e.g., uniform or gold) frequency. To constrain bias, a regularizer RbiasR_{\text{bias}} (weighted by λ\lambda, e.g., $0.01$) can be included in the loss: total_loss=Lpair+λRbias\text{total\_loss} = \mathcal{L}_{\text{pair}} + \lambda \cdot R_{\text{bias}} This regularizer is computed on-the-fly via minibatch statistics.

A key consideration is that λ\lambda should remain small to avoid overpowering the effect of the pairwise DPO loss. This mechanism allows the model to remain sensitive to context while maintaining broad coverage across strategy types.

6. CSO Chain-of-Turns Training Workflow

The following high-level pseudocode captures the chain preference grouping and training cycle:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
ConstructESCPro(seed_conversations, n_iter, θ):
  for each seed dialogue do
    initialize MCTS tree with root = first seeker utterance
    for t in 1..n_iter:
      S_sel = SelectPUCB(root)
      if S_sel is unexpanded:
        ExpandNode(S_sel)
      R_sim = SimulateRollout(S_sel)
      Backpropagate(S_sel, R_sim)
    Extract valid paths P where Q>θ at every turn
    For each layer i in P:
      let G_i = {sibling | Q>θ}, G_i = {sibling | Q<θ}
      BalanceGroup(G_i, G_i)
      add all (s_w,s_l) pairs to 𝒟
  return 𝒟

TrainCSO(𝒟, π_ref, epochs):
  initialize πθ  pretrained LLM
  for epoch in 1..epochs:
    for batch of (x,H,S_w,S_l) in 𝒟:
      compute ℒ_pair via DPO
      optionally compute R_bias over batch
      total_loss = ℒ_pair + λ·R_bias
      update πθ via gradient step
  return πθ

Each strategy pair in D\mathcal{D} is turn-tagged, enabling the model to chain turn-level preferences through full dialogue trajectories.

7. Empirical Validation and Observed Impact

Empirical results on LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B demonstrate that CSO with unbiased chain preference grouping achieves:

  • A 3–5 point absolute increase in macro-F1 (Q\mathcal{Q}) for strategy accuracy over standard SFT.
  • A large reduction in preference bias BB, e.g., from 2.51.1\sim2.5\to1.1 (LLaMA-3.1-8B).
  • Consistent improvements in weighted-F1 (QW\mathcal{Q}_W) and ROUGE-L, with no observable loss in semantic response quality.
  • "CSO-Random" ablation, wherein negatives are sampled randomly rather than via MCTS-based grouping, yields lower gains (2\sim2 F1 points behind CSO), attributing performance benefit specifically to high-quality group construction.
  • Human evaluations for acceptability, effectiveness, sensitivity, and satisfaction unanimously favor CSO over SFT, with win rates of 60–70% (Cohen’s κ0.6\kappa\approx0.6).

These findings indicate that unbiased chain preference grouping yields LLMs capable of more context-appropriate, less biased emotional support, substantiating its role in improving both equity and quality in ESC strategy selection (Zhao et al., 7 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Unbiased Chain Preference Grouping Strategy.