Unbiased Chain Preference Grouping

Updated 8 January 2026

Unbiased chain preference grouping is a strategy within the CSO framework that balances emotional support strategy candidates to mitigate bias in multi-turn LLM dialogues.
It employs Monte Carlo Tree Search to systematically explore and group candidate strategies based on performance metrics, ensuring diverse and fair selection.
Empirical evaluations show improved macro-F1 scores and reduced bias metrics, resulting in more contextually appropriate and empathetic responses.

Unbiased chain preference grouping is an approach within the Chain-of-Strategy Optimization (CSO) framework that enables LLMs to select emotional-support strategies in multi-turn dialogue without systematically favoring particular classes of strategies. This methodology addresses preference bias during learning by constructing, at each dialogue turn, balanced sets of strategy-response pairs for fine-grained policy supervision. Core to the approach is the use of Monte Carlo Tree Search (MCTS) to explore and organize strategy candidates, ensuring coverage and diversity in both preferred and non-preferred examples, thereby facilitating empathy and adaptability in emotional support conversations (Zhao et al., 7 Mar 2025).

1. Problem Definition and Motivation

The primary goal is to learn a policy $\pi_\theta$ that ranks a set of emotional-support strategies $S = \{s_1, \dots, s_K\}$ at each dialogue turn $i$ , such that strategy selection adapts to conversational context rather than exhibiting systematic bias (e.g., overusing "Emotional Validation" irrespective of need). In preference-supervised learning, for each turn $i$ under context $(x, H^{i-1})$ , strategy pairs are extracted: a more-preferred strategy $s_w$ and one or more less-preferred strategies $s_l$ . The learning process benefits from these fine-grained, turn-level comparisons.

Bias in preference ranking is quantified using the metric: $B_i = \sum_{s \in S} \left| \hat{p}_\theta^i(s) - p_i^*(s) \right|$ where $\hat{p}_\theta^i(s)$ is the empirical model selection frequency and $p_i^*(s)$ is the ideal, context-dependent frequency for strategy $s$ at turn $i$ . The average $B = (1/L) \sum_i B_i$ gives an overall measure across the dialogue.

The unbiased grouping objective at each turn is to construct, for every $i$ , a balanced set of positives (high-quality, preferred) and negatives (low-quality, non-preferred) from candidate strategies. This ensures the model learns to discriminate effectively while avoiding skewed exposure that would amplify bias.

2. Mathematical & Algorithmic Foundations

Candidate strategies at each turn are explored using an MCTS process with the following PUCB value function: $\text{PUCB}(S) = Q(S) + c \cdot P(S) \frac{\sqrt{N(\text{parent}(S))}}{N(S)+1}$ Here $P(S)$ is the normalized strategy-LLM score, $Q(S)$ is the average rollout reward, and $c$ modulates exploration.

The rollout reward combines several dimensions: $R = \frac{E + I + H + \alpha S}{10} + b$ where $E$ , $I$ , $H$ , and $S$ denote empathy, information, humanoid, and strategy scores, respectively; $\alpha$ and $b$ are constants for weighting and normalization.

Valid strategy-response paths $P$ are mined by requiring $Q(S_i) > \theta$ at all turns and that the terminal node is reached. Preference datasets $\mathcal{D}$ are then formed by pairing high- $Q$ siblings (preferred) with low- $Q$ siblings (non-preferred) at each layer: $\mathcal{D} = \bigcup_{P \in \text{valid}} \{(S_w, S_l) \mid S_w, S_l \in P,\ Q(S_w) > \theta,\ Q(S_l)<\theta \}$

CSO employs a turn-level DPO (Direct Preference Optimization) loss: $r_w = \frac{\pi_\theta(S_w^i \mid x, H^{i-1})}{\pi_{\text{ref}}(S_w^i \mid x, H^{i-1})},\quad r_l = \frac{\pi_\theta(S_l^i \mid x, H^{i-1})}{\pi_{\text{ref}}(S_l^i \mid x, H^{i-1})}$

$\mathcal{L}_i = -\log\sigma\left[\beta \cdot (\log r_w - \log r_l)\right]$

The total objective aggregates over all pairs and turns: $\mathcal{L}_{\text{CSO}} = \mathbb{E}_{(x,H^{i-1},S_w^i,S_l^i)\sim\mathcal{D}} [\mathcal{L}_i]$ An additional regularizer can be introduced to penalize per-turn deviation from balanced strategy selection: $R_{\text{bias}} = \lambda \sum_{i=1}^L \beta_i,\quad \beta_i = \sum_s |\hat{p}_\theta^i(s) - \bar{p}(s)|$ where $\bar{p}(s)$ is a uniform or gold-standard distribution.

3. Candidate Group Construction via MCTS

The MCTS framework steps are as follows:

Node Encoding: Each tree node represents a partial conversation; siblings at depth $i$ correspond to different strategy choices at turn $i$ .
Candidate Generation: Expansion at each node uses the Strategy LLM to propose $K$ strategies; child nodes initialize with $Q=0$ , $N=0$ .
Rollouts: Simulated dialogues are generated where the Supporter LLM and Seeker LLM interact for a fixed number of steps or until an end token.
Scoring and Backpropagation: The Reward LLM scores completed rollouts; rewards are propagated up the tree to update Q-values.
Iterative Expansion: This process continues until the search budget is exhausted or sufficient terminal paths are constructed.

Valid paths with $Q(S_i) > \theta \forall i$ are extracted. At each layer, explored sibling nodes (i.e., alternative strategies for turn $i$ ) are available for grouping.

4. Unbiased Grouping and Data Balancing Protocols

At each dialogue turn, candidate strategies are grouped by thresholding their Q-values at $\theta$ to form high-Q (preferred) and low-Q (non-preferred) sets. To avoid situations where a strategy is never included in the negative set—thus risking overfitting toward that strategy—an explicit balancing procedure is implemented:

If any strategy $s$ does not appear in the low-Q group in the initial search, an artificial negative is generated by forcing $s$ as the strategy in a sibling path and labeling it non-preferred.
To prevent imbalance, the group sizes of preferred and non-preferred sets are capped and subsampled as necessary for approximate parity.

The resulting dataset ensures that all strategies are adequately represented in both positive and negative preference contexts, directly addressing potential bias amplification.

5. Bias Quantification and Mitigation Mechanisms

Bias during training is monitored using the per-turn deviation metric $\beta_i = \sum_s |\hat{p}_\theta^i(s) - \bar{p}(s)|$ , where $\hat{p}_\theta^i$ reflects empirical frequencies and $\bar{p}(s)$ is the target (e.g., uniform or gold) frequency. To constrain bias, a regularizer $R_{\text{bias}}$ (weighted by $\lambda$ , e.g., $0.01$) can be included in the loss: $\text{total\_loss} = \mathcal{L}_{\text{pair}} + \lambda \cdot R_{\text{bias}}$ This regularizer is computed on-the-fly via minibatch statistics.

A key consideration is that $\lambda$ should remain small to avoid overpowering the effect of the pairwise DPO loss. This mechanism allows the model to remain sensitive to context while maintaining broad coverage across strategy types.

6. CSO Chain-of-Turns Training Workflow

The following high-level pseudocode captures the chain preference grouping and training cycle:

ConstructESCPro(seed_conversations, n_iter, θ):
  for each seed dialogue do
    initialize MCTS tree with root = first seeker utterance
    for t in 1..n_iter:
      S_sel = SelectPUCB(root)
      if S_sel is unexpanded:
        ExpandNode(S_sel)
      R_sim = SimulateRollout(S_sel)
      Backpropagate(S_sel, R_sim)
    Extract valid paths P where Q>θ at every turn
    For each layer i in P:
      let G⁺_i = {sibling | Q>θ}, G⁻_i = {sibling | Q<θ}
      BalanceGroup(G⁺_i, G⁻_i)
      add all (s_w,s_l) pairs to 𝒟
  return 𝒟

TrainCSO(𝒟, π_ref, epochs):
  initialize πθ ← pretrained LLM
  for epoch in 1..epochs:
    for batch of (x,H,S_w,S_l) in 𝒟:
      compute ℒ_pair via DPO
      optionally compute R_bias over batch
      total_loss = ℒ_pair + λ·R_bias
      update πθ via gradient step
  return πθ

Each strategy pair in $\mathcal{D}$ is turn-tagged, enabling the model to chain turn-level preferences through full dialogue trajectories.

7. Empirical Validation and Observed Impact

Empirical results on LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B demonstrate that CSO with unbiased chain preference grouping achieves:

A 3–5 point absolute increase in macro-F1 ( $\mathcal{Q}$ ) for strategy accuracy over standard SFT.
A large reduction in preference bias $B$ , e.g., from $\sim2.5\to1.1$ (LLaMA-3.1-8B).
Consistent improvements in weighted-F1 ( $\mathcal{Q}_W$ ) and ROUGE-L, with no observable loss in semantic response quality.
"CSO-Random" ablation, wherein negatives are sampled randomly rather than via MCTS-based grouping, yields lower gains ( $\sim2$ F1 points behind CSO), attributing performance benefit specifically to high-quality group construction.
Human evaluations for acceptability, effectiveness, sensitivity, and satisfaction unanimously favor CSO over SFT, with win rates of 60–70% (Cohen’s $\kappa\approx0.6$ ).

These findings indicate that unbiased chain preference grouping yields LLMs capable of more context-appropriate, less biased emotional support, substantiating its role in improving both equity and quality in ESC strategy selection (Zhao et al., 7 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Unbiased Chain Preference Grouping Strategy.