Proxy-Policy Alignment Conflict Score (PACS)
- PACS is a normalized metric that quantifies local misalignments between a base policy and a proxy reward model using per-prompt z-scores.
- It identifies extreme disagreements—such as overvalued responses by the reward model versus underproduced policy outputs—to direct focused human feedback.
- Integrating with SHF-CAS frameworks, PACS improves sample efficiency and alignment performance, while reducing annotation costs.
The Proxy-Policy Alignment Conflict Score (PACS) is a normalized, pointwise metric designed to identify and quantify the most extreme local misalignments between a base policy and a proxy reward model in reward-model-based LLM alignment. It serves as a core component of conflict-aware frameworks targeting misalignment, providing a principled tool for diagnosing, analyzing, and mitigating cases where the policy's learned behaviors substantially diverge from the preferences encoded in a biased or imperfect reward model (Liu et al., 10 Dec 2025).
1. Formal Definition
Consider a prompt and a sampled completion . The base policy assigns a likelihood to given , while the proxy reward model outputs a scalar reward for the pair. Raw comparisons such as are invalid due to incompatible scales and calibrations. PACS resolves this by normalizing both quantities via per-prompt -scores:
Given completions sampled from ,
Define -scores:
The PACS metric for is:
This normalization step makes PACS invariant to the absolute magnitudes of the policy and reward outputs, yielding a robust, interpretable conflict score across prompts.
2. Intuition and Conflict Typology
PACS directly measures the degree to which the base policy's distribution and the proxy reward model's assessment for a given completion are locally discordant:
- High proxy reward & low model probability: The reward model values an answer that the base model would not naturally produce, signaling a potential area where the proxy is extrapolating beyond the model's actual competence.
- Low proxy reward & high model probability: The base model confidently produces responses the proxy deems undesirable, potentially reflecting biases or miscalibrations in policy learning.
These "outlier" pairs, with high PACS values, frequently reside in domains of shared ignorance—prompt-completion regions insufficiently covered or understood by both the policy and the reward model. Such regions are susceptible to persistent misalignment and benefit most from targeted human supervision rather than routine fine-tuning.
3. Normalization and Practical Computation
PACS's -score normalization is computed per prompt:
- –$16$ completions are sampled from to estimate empirical means and standard deviations.
- This ensures that PACS values reflect relative disagreement, independent of baseline scale differences, reward uncalibration, or skewed model confidences.
The practical procedure is robust and computationally efficient under realistic sampling budgets.
4. Relationship to the Global Kendall-Tau Distance
PACS is a pointwise metric, whereas the global Kendall-Tau (K-T) distance per prompt quantifies the rank correlation between the base policy and the proxy reward model across all completions:
where is the number of concordant pairs and is the number of discordant pairs, based on the two rankings (policy log-probability and proxy reward).
- PACS identifies and localizes the single completions responsible for significant conflicts.
- K-T Distance provides a global summary of ranking misalignment for a prompt.
SHF-CAS, the conflict-aware feedback algorithm, exploits this complementarity: it first selects prompts with high global conflict (low K-T) and then targets high-PACS completions for human-in-the-loop intervention.
5. Integration within the SHF-CAS Alignment Framework
The Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) algorithm employs PACS as a central mechanism for efficient human supervision. The workflow is as follows:
- Initialize human feedback budget ; repeat for up to refinement rounds.
- For each prompt in dataset :
a. Sample completions from .
b. Compute Kendall-Tau distance K-T.
c. If K-T (significant conflict), then:
- Compute PACS for all .
- If , add to the conflict pool .
- If , retain only the top pairs by mean PACS.
- Obtain human feedback on .
- Refine and using the new data. Repeat as budget allows.
This method prioritizes supervision on completions that are most indicative of policy-proxy misalignment, optimizing human feedback efficiency.
6. Key Empirical Results
Empirical evaluation on PKU-SafeRLHF (safety) and Anthropic HH-RLHF (helpfulness) demonstrates the impact of PACS-driven interventions:
| Dataset | Approach | GoldR | PACS | K-T |
|---|---|---|---|---|
| PKU-SafeRLHF | PPO | 3.92 | 1.11 | 0.042 |
| RSO | 2.84 | 1.14 | 0.037 | |
| SHF-CAS | –2.90 | 0.16 | 0.34 | |
| HH-RLHF | PPO | –2.32 | 1.85 | 0.27 |
| RSO | –1.02 | 1.62 | 0.34 | |
| SHF-CAS | 3.36 | 0.61 | 0.64 |
- Raising the PACS threshold (targeting more extreme conflicts) increases alignment performance, even with fewer labeled examples.
- SHF-CAS systematically outperforms random selection of feedback targets and statistical rejection sampling, indicating sample efficiency is driven by conflict-based selection as opposed to data quantity alone.
- Post-refinement, the average PACS drops markedly, and K-T distance improves, showing that SHF-CAS using PACS can rapidly localize and mitigate residual misalignment.
This suggests that addressing high-PACS pairs comprises an effective strategy for targeted supervision in alignment pipelines subject to annotation noise or reward model bias.
7. Significance and Implications
PACS provides a simple and reliable means of surfacing the most consequential local misalignments between policy behaviors and supervisory signals. By normalizing disagreements and integrating with conflict-aware selection schemes, PACS enables alignment frameworks to:
- Pinpoint and prioritize regions of uncertainty, bias, or shared ignorance for further inspection.
- Achieve higher alignment quality with reduced annotation costs.
- Offer a principled diagnostic perspective on the failure modes of reward-model-based RLHF and similar paradigms.
A plausible implication is that future alignment protocols can leverage PACS-like metrics for curriculum design, automated feedback routing, or transparency tools, thus facilitating more robust and interpretable LLM deployment under weak or noisy supervision (Liu et al., 10 Dec 2025).