Papers
Topics
Authors
Recent
2000 character limit reached

Proxy-Policy Alignment Conflict Score (PACS)

Updated 11 December 2025
  • PACS is a normalized metric that quantifies local misalignments between a base policy and a proxy reward model using per-prompt z-scores.
  • It identifies extreme disagreements—such as overvalued responses by the reward model versus underproduced policy outputs—to direct focused human feedback.
  • Integrating with SHF-CAS frameworks, PACS improves sample efficiency and alignment performance, while reducing annotation costs.

The Proxy-Policy Alignment Conflict Score (PACS) is a normalized, pointwise metric designed to identify and quantify the most extreme local misalignments between a base policy and a proxy reward model in reward-model-based LLM alignment. It serves as a core component of conflict-aware frameworks targeting misalignment, providing a principled tool for diagnosing, analyzing, and mitigating cases where the policy's learned behaviors substantially diverge from the preferences encoded in a biased or imperfect reward model (Liu et al., 10 Dec 2025).

1. Formal Definition

Consider a prompt xx and a sampled completion yy. The base policy πbase(yx)\pi_{\text{base}}(y|x) assigns a likelihood to yy given xx, while the proxy reward model rproxy(x,y)r_{\text{proxy}}(x, y) outputs a scalar reward for the pair. Raw comparisons such as rproxy(x,y)logπbase(yx)|r_{\text{proxy}}(x, y) - \log \pi_{\text{base}}(y|x)| are invalid due to incompatible scales and calibrations. PACS resolves this by normalizing both quantities via per-prompt zz-scores:

Given NN completions {y1,...,yN}\{y_1, ..., y_N\} sampled from πbase(x)\pi_{\text{base}}(\cdot|x),

  • μrx=meani[rproxy(x,yi)]\mu_r^x = \text{mean}_i\, [r_{\text{proxy}}(x, y_i)]
  • σrx=stdi[rproxy(x,yi)]\sigma_r^x = \text{std}_i\, [r_{\text{proxy}}(x, y_i)]
  • μπx=meani[logπbase(yix)]\mu_\pi^x = \text{mean}_i\, [\log \pi_{\text{base}}(y_i|x)]
  • σπx=stdi[logπbase(yix)]\sigma_\pi^x = \text{std}_i\, [\log \pi_{\text{base}}(y_i|x)]

Define zz-scores:

  • zr(x,y)=[rproxy(x,y)μrx]/σrxz_r(x, y) = [r_{\text{proxy}}(x, y) - \mu_r^x] / \sigma_r^x
  • zπ(x,y)=[logπbase(yx)μπx]/σπxz_\pi(x, y) = [\log \pi_{\text{base}}(y|x) - \mu_\pi^x] / \sigma_\pi^x

The PACS metric for (x,y)(x, y) is:

PACS(x,y)=zr(x,y)zπ(x,y)\operatorname{PACS}(x, y) = |z_r(x, y) - z_\pi(x, y)|

This normalization step makes PACS invariant to the absolute magnitudes of the policy and reward outputs, yielding a robust, interpretable conflict score across prompts.

2. Intuition and Conflict Typology

PACS directly measures the degree to which the base policy's distribution and the proxy reward model's assessment for a given completion are locally discordant:

  • High proxy reward & low model probability: The reward model values an answer that the base model would not naturally produce, signaling a potential area where the proxy is extrapolating beyond the model's actual competence.
  • Low proxy reward & high model probability: The base model confidently produces responses the proxy deems undesirable, potentially reflecting biases or miscalibrations in policy learning.

These "outlier" (x,y)(x, y) pairs, with high PACS values, frequently reside in domains of shared ignorance—prompt-completion regions insufficiently covered or understood by both the policy and the reward model. Such regions are susceptible to persistent misalignment and benefit most from targeted human supervision rather than routine fine-tuning.

3. Normalization and Practical Computation

PACS's zz-score normalization is computed per prompt:

  • N8N \approx 8–$16$ completions are sampled from πbase(x)\pi_{\text{base}}(\cdot|x) to estimate empirical means and standard deviations.
  • This ensures that PACS values reflect relative disagreement, independent of baseline scale differences, reward uncalibration, or skewed model confidences.

The practical procedure is robust and computationally efficient under realistic sampling budgets.

4. Relationship to the Global Kendall-Tau Distance

PACS is a pointwise metric, whereas the global Kendall-Tau (K-T) distance per prompt quantifies the rank correlation between the base policy and the proxy reward model across all NN completions:

K-T(x)=CD12N(N1)\text{K-T}(x) = \frac{C - D}{\frac{1}{2} N (N-1)}

where CC is the number of concordant pairs and DD is the number of discordant pairs, based on the two rankings (policy log-probability and proxy reward).

  • PACS identifies and localizes the single completions responsible for significant conflicts.
  • K-T Distance provides a global summary of ranking misalignment for a prompt.

SHF-CAS, the conflict-aware feedback algorithm, exploits this complementarity: it first selects prompts with high global conflict (low K-T) and then targets high-PACS completions for human-in-the-loop intervention.

5. Integration within the SHF-CAS Alignment Framework

The Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) algorithm employs PACS as a central mechanism for efficient human supervision. The workflow is as follows:

  1. Initialize human feedback budget HH; repeat for up to II refinement rounds.
  2. For each prompt xx in dataset DD: a. Sample NN completions {yi}\{y_i\} from πbase\pi_{\text{base}}. b. Compute Kendall-Tau distance K-T(x)(x). c. If K-T(x)<τ(x) < \tau (significant conflict), then:
    • Compute PACS(x,yi)(x, y_i) for all yiy_i.
    • If meaniPACS(x,yi)>δ\text{mean}_i\, \text{PACS}(x, y_i) > \delta, add (x,yi)(x, y_i) to the conflict pool CC.
  3. If C>H|C|>H, retain only the top HH pairs by mean PACS.
  4. Obtain human feedback on CC.
  5. Refine rproxyr_{\text{proxy}} and πbase\pi_{\text{base}} using the new data. Repeat as budget allows.

This method prioritizes supervision on completions that are most indicative of policy-proxy misalignment, optimizing human feedback efficiency.

6. Key Empirical Results

Empirical evaluation on PKU-SafeRLHF (safety) and Anthropic HH-RLHF (helpfulness) demonstrates the impact of PACS-driven interventions:

Dataset Approach GoldR PACS K-T
PKU-SafeRLHF PPO 3.92 1.11 0.042
RSO 2.84 1.14 0.037
SHF-CAS δ=1.6\delta=1.6 –2.90 0.16 0.34
HH-RLHF PPO –2.32 1.85 0.27
RSO –1.02 1.62 0.34
SHF-CAS δ=1.5\delta=1.5 3.36 0.61 0.64
  • Raising the PACS threshold δ\delta (targeting more extreme conflicts) increases alignment performance, even with fewer labeled examples.
  • SHF-CAS systematically outperforms random selection of feedback targets and statistical rejection sampling, indicating sample efficiency is driven by conflict-based selection as opposed to data quantity alone.
  • Post-refinement, the average PACS drops markedly, and K-T distance improves, showing that SHF-CAS using PACS can rapidly localize and mitigate residual misalignment.

This suggests that addressing high-PACS pairs comprises an effective strategy for targeted supervision in alignment pipelines subject to annotation noise or reward model bias.

7. Significance and Implications

PACS provides a simple and reliable means of surfacing the most consequential local misalignments between policy behaviors and supervisory signals. By normalizing disagreements and integrating with conflict-aware selection schemes, PACS enables alignment frameworks to:

  • Pinpoint and prioritize regions of uncertainty, bias, or shared ignorance for further inspection.
  • Achieve higher alignment quality with reduced annotation costs.
  • Offer a principled diagnostic perspective on the failure modes of reward-model-based RLHF and similar paradigms.

A plausible implication is that future alignment protocols can leverage PACS-like metrics for curriculum design, automated feedback routing, or transparency tools, thus facilitating more robust and interpretable LLM deployment under weak or noisy supervision (Liu et al., 10 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Proxy-Policy Alignment Conflict Score (PACS).