SHF-CAS: Conflict-Aware Sampling for LLM Alignment
- The paper introduces SHF-CAS, a targeted active learning strategy that quantifies conflicts between reward models and base policies for efficient human annotation.
- It employs scale-invariant metrics—PACS and Kendall-Tau—to accurately detect, rank, and select high-conflict prompt-completion pairs during LLM fine-tuning.
- Experimental results demonstrate that SHF-CAS improves alignment in safety and helpfulness benchmarks while significantly reducing annotation cost compared to traditional methods.
Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) is a targeted active learning strategy developed to resolve reward-model–policy misalignment in LLM fine-tuning. It formalizes the detection and remediation of "danger zones" where both a reward model and base policy display systematic uncertainty or error, leveraging principled conflict metrics to guide selective human annotation for optimal feedback efficiency (Liu et al., 10 Dec 2025).
1. Motivation and Overview
Reward-model-based fine-tuning pipelines, such as Reinforcement Learning from Human Feedback (RLHF), depend on proxies—learned reward models —to encode human preferences. However, is inherently imperfect, susceptible to noise, coverage gaps, and bias. Over-optimizing the policy against such proxies can induce undesirable behavior (e.g., reward hacking, overoptimization). Disagreements (conflicts) between and can occur at the response level for given prompts, manifesting either as cases where corrects previously underweighted behaviors (complementary knowledge) or as zones of shared ignorance linked to high risk of misalignment.
SHF-CAS seeks to (i) quantify these conflicts, (ii) actively select high-conflict samples, and (iii) obtain focused human feedback for joint refinement of both and , improving alignment with minimal annotation cost (Liu et al., 10 Dec 2025).
2. Formal Conflict Metrics
Conflict between the base policy and proxy reward is operationalized using two complementary, scale-invariant metrics:
- Proxy-Policy Alignment Conflict Score (PACS): For each candidate response to a prompt , compute the normalized disagreement:
where the means and standard deviations are over completions sampled from .
- Kendall-Tau Distance (K-T): For a set of completions , compare rankings by descending and . The normalized Kendall-Tau statistic is:
$\mathrm{K\mathchar`-T}(x) = \frac{C - D}{M}$
where and are the counts of concordant and discordant pairs respectively, with .
PACS captures per-sample pointwise conflict; K-T measures global rank disagreement across a prompt’s candidate completions. Large values signal actionable misalignment.
3. SHF-CAS Algorithmic Procedure
SHF-CAS iteratively selects high-conflict prompt-completion pairs for human feedback as follows (summarized and lightly paraphrased for technical clarity):
Inputs:
- Base policy
- Proxy reward
- Prompt pool
- Number of completions per prompt
- Conflict thresholds (K-T), (PACS)
- Human-feedback budget
- Max iterations
Iteration Steps:
- For each , sample .
- If , exclude from conflict set (proxy-policy agreement).
- For the rest, compute . If , add all pairs to the conflict set .
- If , retain the top- pairs by descending PACS.
- Obtain human feedback on to form .
- Update fine-tune using ; retrain via RL on the updated .
The loop terminates on feedback exhaustion or depletion of high-conflict regions.
4. Integration of Human Feedback and Training Loop
After annotation, ’s training data is augmented with new human feedback examples (e.g., pairwise preference labels). Further optimization is performed via a Bradley-Terry or logistic-loss objective:
With the refined , policy fine-tuning proceeds using RL algorithms (such as PPO with KL penalty to ). SHF-CAS can be applied for multiple iterations to iteratively sharpen alignment performance in high-conflict regions (Liu et al., 10 Dec 2025).
5. Theoretical Properties and Methodological Significance
No explicit convergence or finite-sample optimality results are provided. However, theoretical remarks emphasize:
- High-conflict regions, as identified by PACS and K-T, mark loci of shared ignorance or disagreement, where model-only RLHF is most susceptible to misalignment.
- Both conflict metrics are scale-invariant and selectively highlight local (PACS) and global (K-T) misalignments.
- The method draws connections to advantage-weighted regression and disagreement-based active learning paradigms, but does not yield analytic sample-complexity bounds.
6. Experimental Protocol and Key Results
Benchmark Tasks:
- Safety alignment: PKU-SafeRLHF dataset (82K comparisons across 19 harm types; split: 73.9K/8.2K). : Pythia-6.9B SFT (7 categories). : Pythia-1B RFT (8 disjoint + 1 overlap).
- Helpfulness alignment: Anthropic HH-RLHF (161K/8.5K train/test). : Pythia-6.9B on 30% dataset. : Pythia-1B on different 30% split.
Baselines:
- PPO (RLHF on )
- RSO (Rejection Sampling + Oracle)
- Random annotation (matched sample count)
- Oracles: beaver-7b-unified-cost, RM-Mistral-7B, GPT-4o as automated judge
Results (excerpt):
| Model | Safety (lower=better) | PACS | K-T | Helpfulness (higher=better) | PACS | K-T |
|---|---|---|---|---|---|---|
| PPO | 3.92 | 1.11 | 0.042 | -2.32 | 1.85 | 0.27 |
| RSO | 2.84 | 1.14 | 0.037 | -1.02 | 1.62 | 0.34 |
| SHF-CAS (best δ) | -2.90 | 0.16 | 0.34 | +3.36 | 0.61 | 0.64 |
| Random | -1.46 | 1.41 | 0.017 | +1.03 | 0.98 | 0.46 |
SHF-CAS consistently outperforms both PPO and RSO. Higher conflict thresholds allow more extreme disagreements to be targeted, yielding greater alignment with fewer feedback examples. The method achieves superior performance at substantially reduced annotation cost ( conflict-selected samples vs. tens of thousands randomly sampled) (Liu et al., 10 Dec 2025).
7. Discussion, Limitations, and Future Directions
Ablation studies demonstrate robustness to (number of completions), , and . Multiple SHF-CAS iterations yield continued, though diminishing, improvement. Reported limitations include:
- Dependence on a performant
- Manual tuning of thresholds , (with possible automation via cross-validation/percentiles)
- Scalability concerns for large or base models
- Absence of theoretical finite-sample guarantees
Outlook for future work includes adaptive thresholding, incorporation of richer conflict metrics, joint multi-objective sampling across diverse alignment axes, and formal paper of sample complexity and convergence properties (Liu et al., 10 Dec 2025).
Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) establishes a principled, metrics-driven framework for active LLM alignment, enabling strategic human intervention precisely where model and proxy uncertainty collude, and delivering empirically validated annotation efficiency and alignment gains.