Papers
Topics
Authors
Recent
2000 character limit reached

SHF-CAS: Conflict-Aware Sampling for LLM Alignment

Updated 11 December 2025
  • The paper introduces SHF-CAS, a targeted active learning strategy that quantifies conflicts between reward models and base policies for efficient human annotation.
  • It employs scale-invariant metrics—PACS and Kendall-Tau—to accurately detect, rank, and select high-conflict prompt-completion pairs during LLM fine-tuning.
  • Experimental results demonstrate that SHF-CAS improves alignment in safety and helpfulness benchmarks while significantly reducing annotation cost compared to traditional methods.

Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) is a targeted active learning strategy developed to resolve reward-model–policy misalignment in LLM fine-tuning. It formalizes the detection and remediation of "danger zones" where both a reward model and base policy display systematic uncertainty or error, leveraging principled conflict metrics to guide selective human annotation for optimal feedback efficiency (Liu et al., 10 Dec 2025).

1. Motivation and Overview

Reward-model-based fine-tuning pipelines, such as Reinforcement Learning from Human Feedback (RLHF), depend on proxies—learned reward models rproxyr_{\textrm{proxy}}—to encode human preferences. However, rproxyr_{\textrm{proxy}} is inherently imperfect, susceptible to noise, coverage gaps, and bias. Over-optimizing the policy π\pi against such proxies can induce undesirable behavior (e.g., reward hacking, overoptimization). Disagreements (conflicts) between πbase\pi_{\textrm{base}} and rproxyr_{\textrm{proxy}} can occur at the response level for given prompts, manifesting either as cases where rproxyr_{\textrm{proxy}} corrects previously underweighted behaviors (complementary knowledge) or as zones of shared ignorance linked to high risk of misalignment.

SHF-CAS seeks to (i) quantify these conflicts, (ii) actively select high-conflict samples, and (iii) obtain focused human feedback for joint refinement of both rproxyr_{\textrm{proxy}} and π\pi, improving alignment with minimal annotation cost (Liu et al., 10 Dec 2025).

2. Formal Conflict Metrics

Conflict between the base policy and proxy reward is operationalized using two complementary, scale-invariant metrics:

PACS(x,y)=rproxy(x,y)μrxσrxlogπbase(yx)μπxσπx\text{PACS}(x,y) = \left| \frac{r_{\mathrm{proxy}}(x,y) - \mu_r^x}{\sigma_r^x} - \frac{\log \pi_{\mathrm{base}}(y|x) - \mu_\pi^x}{\sigma_\pi^x} \right|

where the means and standard deviations are over NN completions sampled from πbase(x)\pi_{\textrm{base}}(\cdot|x).

  • Kendall-Tau Distance (K-T): For a set of completions {yi}\{y_i\}, compare rankings by descending πbase(yix)\pi_{\textrm{base}}(y_i|x) and rproxy(x,yi)r_{\textrm{proxy}}(x,y_i). The normalized Kendall-Tau statistic is:

$\mathrm{K\mathchar`-T}(x) = \frac{C - D}{M}$

where CC and DD are the counts of concordant and discordant pairs respectively, with M=12N(N1)M = \frac{1}{2}N(N-1).

PACS captures per-sample pointwise conflict; K-T measures global rank disagreement across a prompt’s candidate completions. Large values signal actionable misalignment.

3. SHF-CAS Algorithmic Procedure

SHF-CAS iteratively selects high-conflict prompt-completion pairs for human feedback as follows (summarized and lightly paraphrased for technical clarity):

Inputs:

  • Base policy πbase\pi_{\mathrm{base}}
  • Proxy reward rproxyr_{\mathrm{proxy}}
  • Prompt pool D\mathcal{D}
  • Number of completions per prompt NN
  • Conflict thresholds τ\tau (K-T), δ\delta (PACS)
  • Human-feedback budget HH
  • Max iterations II

Iteration Steps:

  1. For each xDx \in \mathcal{D}, sample {y1,,yN}πbase(x)\{y_1,\ldots,y_N\}\sim\pi_{\mathrm{base}}(\cdot|x).
  2. If K-T(x)τ\text{K-T}(x) \ge \tau, exclude xx from conflict set (proxy-policy agreement).
  3. For the rest, compute μPACSx=1Ni=1NPACS(x,yi)\mu_{\mathrm{PACS}}^x = \frac{1}{N}\sum_{i=1}^N \text{PACS}(x,y_i). If μPACSxδ\mu_{\mathrm{PACS}}^x \ge \delta, add all (x,yi)(x,y_i) pairs to the conflict set CC.
  4. If C>H|C| > H, retain the top-HH pairs by descending PACS.
  5. Obtain human feedback on CC to form HfeedbackH_{\textrm{feedback}}.
  6. Update rproxyr_{\mathrm{proxy}}\leftarrow fine-tune using HfeedbackH_{\textrm{feedback}}; retrain π\pi via RL on the updated rproxyr_{\mathrm{proxy}}.

The loop terminates on feedback exhaustion or depletion of high-conflict regions.

4. Integration of Human Feedback and Training Loop

After annotation, rproxyr_{\textrm{proxy}}’s training data is augmented with new human feedback examples (e.g., pairwise preference labels). Further optimization is performed via a Bradley-Terry or logistic-loss objective:

L(r;DHfeedback)=E(x,y+,y)logσ(r(x,y+)r(x,y))\mathcal{L}(r; \mathcal{D}\cup H_{\textrm{feedback}}) = -\mathbb{E}_{(x,y^+,y^-)} \log \sigma(r(x,y^+) - r(x,y^-))

With the refined rproxyr_{\textrm{proxy}}, policy fine-tuning proceeds using RL algorithms (such as PPO with KL penalty to πbase\pi_{\textrm{base}}). SHF-CAS can be applied for multiple iterations to iteratively sharpen alignment performance in high-conflict regions (Liu et al., 10 Dec 2025).

5. Theoretical Properties and Methodological Significance

No explicit convergence or finite-sample optimality results are provided. However, theoretical remarks emphasize:

  • High-conflict regions, as identified by PACS and K-T, mark loci of shared ignorance or disagreement, where model-only RLHF is most susceptible to misalignment.
  • Both conflict metrics are scale-invariant and selectively highlight local (PACS) and global (K-T) misalignments.
  • The method draws connections to advantage-weighted regression and disagreement-based active learning paradigms, but does not yield analytic sample-complexity bounds.

6. Experimental Protocol and Key Results

Benchmark Tasks:

  • Safety alignment: PKU-SafeRLHF dataset (82K comparisons across 19 harm types; split: 73.9K/8.2K). πbase\pi_{\mathrm{base}}: Pythia-6.9B SFT (7 categories). rproxyr_{\mathrm{proxy}}: Pythia-1B RFT (8 disjoint + 1 overlap).
  • Helpfulness alignment: Anthropic HH-RLHF (161K/8.5K train/test). πbase\pi_{\mathrm{base}}: Pythia-6.9B on 30% dataset. rproxyr_{\mathrm{proxy}}: Pythia-1B on different 30% split.

Baselines:

  • PPO (RLHF on rproxyr_{\textrm{proxy}})
  • RSO (Rejection Sampling + Oracle)
  • Random annotation (matched sample count)
  • Oracles: beaver-7b-unified-cost, RM-Mistral-7B, GPT-4o as automated judge

Results (excerpt):

Model Safety (lower=better) PACS K-T Helpfulness (higher=better) PACS K-T
PPO 3.92 1.11 0.042 -2.32 1.85 0.27
RSO 2.84 1.14 0.037 -1.02 1.62 0.34
SHF-CAS (best δ) -2.90 0.16 0.34 +3.36 0.61 0.64
Random -1.46 1.41 0.017 +1.03 0.98 0.46

SHF-CAS consistently outperforms both PPO and RSO. Higher conflict thresholds δ\delta allow more extreme disagreements to be targeted, yielding greater alignment with fewer feedback examples. The method achieves superior performance at substantially reduced annotation cost (O(1K)O(1\,\textrm{K}) conflict-selected samples vs. tens of thousands randomly sampled) (Liu et al., 10 Dec 2025).

7. Discussion, Limitations, and Future Directions

Ablation studies demonstrate robustness to NN (number of completions), τ\tau, and δ\delta. Multiple SHF-CAS iterations yield continued, though diminishing, improvement. Reported limitations include:

  • Dependence on a performant πbase\pi_{\textrm{base}}
  • Manual tuning of thresholds τ\tau, δ\delta (with possible automation via cross-validation/percentiles)
  • Scalability concerns for large NN or base models
  • Absence of theoretical finite-sample guarantees

Outlook for future work includes adaptive thresholding, incorporation of richer conflict metrics, joint multi-objective sampling across diverse alignment axes, and formal paper of sample complexity and convergence properties (Liu et al., 10 Dec 2025).


Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) establishes a principled, metrics-driven framework for active LLM alignment, enabling strategic human intervention precisely where model and proxy uncertainty collude, and delivering empirically validated annotation efficiency and alignment gains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS).