Unbiased Chain Preference Grouping
- Unbiased chain preference grouping is a strategy within the CSO framework that balances emotional support strategy candidates to mitigate bias in multi-turn LLM dialogues.
- It employs Monte Carlo Tree Search to systematically explore and group candidate strategies based on performance metrics, ensuring diverse and fair selection.
- Empirical evaluations show improved macro-F1 scores and reduced bias metrics, resulting in more contextually appropriate and empathetic responses.
Unbiased chain preference grouping is an approach within the Chain-of-Strategy Optimization (CSO) framework that enables LLMs to select emotional-support strategies in multi-turn dialogue without systematically favoring particular classes of strategies. This methodology addresses preference bias during learning by constructing, at each dialogue turn, balanced sets of strategy-response pairs for fine-grained policy supervision. Core to the approach is the use of Monte Carlo Tree Search (MCTS) to explore and organize strategy candidates, ensuring coverage and diversity in both preferred and non-preferred examples, thereby facilitating empathy and adaptability in emotional support conversations (Zhao et al., 7 Mar 2025).
1. Problem Definition and Motivation
The primary goal is to learn a policy that ranks a set of emotional-support strategies at each dialogue turn , such that strategy selection adapts to conversational context rather than exhibiting systematic bias (e.g., overusing "Emotional Validation" irrespective of need). In preference-supervised learning, for each turn under context , strategy pairs are extracted: a more-preferred strategy and one or more less-preferred strategies . The learning process benefits from these fine-grained, turn-level comparisons.
Bias in preference ranking is quantified using the metric: where is the empirical model selection frequency and is the ideal, context-dependent frequency for strategy at turn . The average gives an overall measure across the dialogue.
The unbiased grouping objective at each turn is to construct, for every , a balanced set of positives (high-quality, preferred) and negatives (low-quality, non-preferred) from candidate strategies. This ensures the model learns to discriminate effectively while avoiding skewed exposure that would amplify bias.
2. Mathematical & Algorithmic Foundations
Candidate strategies at each turn are explored using an MCTS process with the following PUCB value function: Here is the normalized strategy-LLM score, is the average rollout reward, and modulates exploration.
The rollout reward combines several dimensions: where , , , and denote empathy, information, humanoid, and strategy scores, respectively; and are constants for weighting and normalization.
Valid strategy-response paths are mined by requiring at all turns and that the terminal node is reached. Preference datasets are then formed by pairing high- siblings (preferred) with low- siblings (non-preferred) at each layer:
CSO employs a turn-level DPO (Direct Preference Optimization) loss:
The total objective aggregates over all pairs and turns: An additional regularizer can be introduced to penalize per-turn deviation from balanced strategy selection: where is a uniform or gold-standard distribution.
3. Candidate Group Construction via MCTS
The MCTS framework steps are as follows:
- Node Encoding: Each tree node represents a partial conversation; siblings at depth correspond to different strategy choices at turn .
- Candidate Generation: Expansion at each node uses the Strategy LLM to propose strategies; child nodes initialize with , .
- Rollouts: Simulated dialogues are generated where the Supporter LLM and Seeker LLM interact for a fixed number of steps or until an end token.
- Scoring and Backpropagation: The Reward LLM scores completed rollouts; rewards are propagated up the tree to update Q-values.
- Iterative Expansion: This process continues until the search budget is exhausted or sufficient terminal paths are constructed.
Valid paths with are extracted. At each layer, explored sibling nodes (i.e., alternative strategies for turn ) are available for grouping.
4. Unbiased Grouping and Data Balancing Protocols
At each dialogue turn, candidate strategies are grouped by thresholding their Q-values at to form high-Q (preferred) and low-Q (non-preferred) sets. To avoid situations where a strategy is never included in the negative set—thus risking overfitting toward that strategy—an explicit balancing procedure is implemented:
- If any strategy does not appear in the low-Q group in the initial search, an artificial negative is generated by forcing as the strategy in a sibling path and labeling it non-preferred.
- To prevent imbalance, the group sizes of preferred and non-preferred sets are capped and subsampled as necessary for approximate parity.
The resulting dataset ensures that all strategies are adequately represented in both positive and negative preference contexts, directly addressing potential bias amplification.
5. Bias Quantification and Mitigation Mechanisms
Bias during training is monitored using the per-turn deviation metric , where reflects empirical frequencies and is the target (e.g., uniform or gold) frequency. To constrain bias, a regularizer (weighted by , e.g., $0.01$) can be included in the loss: This regularizer is computed on-the-fly via minibatch statistics.
A key consideration is that should remain small to avoid overpowering the effect of the pairwise DPO loss. This mechanism allows the model to remain sensitive to context while maintaining broad coverage across strategy types.
6. CSO Chain-of-Turns Training Workflow
The following high-level pseudocode captures the chain preference grouping and training cycle:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
ConstructESCPro(seed_conversations, n_iter, θ): for each seed dialogue do initialize MCTS tree with root = first seeker utterance for t in 1..n_iter: S_sel = SelectPUCB(root) if S_sel is unexpanded: ExpandNode(S_sel) R_sim = SimulateRollout(S_sel) Backpropagate(S_sel, R_sim) Extract valid paths P where Q>θ at every turn For each layer i in P: let G⁺_i = {sibling | Q>θ}, G⁻_i = {sibling | Q<θ} BalanceGroup(G⁺_i, G⁻_i) add all (s_w,s_l) pairs to 𝒟 return 𝒟 TrainCSO(𝒟, π_ref, epochs): initialize πθ ← pretrained LLM for epoch in 1..epochs: for batch of (x,H,S_w,S_l) in 𝒟: compute ℒ_pair via DPO optionally compute R_bias over batch total_loss = ℒ_pair + λ·R_bias update πθ via gradient step return πθ |
Each strategy pair in is turn-tagged, enabling the model to chain turn-level preferences through full dialogue trajectories.
7. Empirical Validation and Observed Impact
Empirical results on LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B demonstrate that CSO with unbiased chain preference grouping achieves:
- A 3–5 point absolute increase in macro-F1 () for strategy accuracy over standard SFT.
- A large reduction in preference bias , e.g., from (LLaMA-3.1-8B).
- Consistent improvements in weighted-F1 () and ROUGE-L, with no observable loss in semantic response quality.
- "CSO-Random" ablation, wherein negatives are sampled randomly rather than via MCTS-based grouping, yields lower gains ( F1 points behind CSO), attributing performance benefit specifically to high-quality group construction.
- Human evaluations for acceptability, effectiveness, sensitivity, and satisfaction unanimously favor CSO over SFT, with win rates of 60–70% (Cohen’s ).
These findings indicate that unbiased chain preference grouping yields LLMs capable of more context-appropriate, less biased emotional support, substantiating its role in improving both equity and quality in ESC strategy selection (Zhao et al., 7 Mar 2025).