Papers
Topics
Authors
Recent
2000 character limit reached

Policy Corruption Score (PCS)

Updated 27 December 2025
  • Policy Corruption Score (PCS) is a multi-dimensional metric that measures LLM policy shifts from safe refusals to harmful compliance under adversarial pressure.
  • It decomposes policy corruption into seven diagnostic dimensions across axes like Susceptibility to Influence, Core Safety Erosion, and Manifested Cognitive Destabilization.
  • The evaluation protocol employs adversarial prompts and a judge-in-the-loop Likert scoring process to rigorously assess internal value-system destabilization.

The Policy Corruption Score (PCS) is a multi-dimensional metric introduced to quantify the extent to which an adversarial attack shifts a LLM’s (LLM’s) internal policy away from safety-aligned refusals toward harmful compliance. By decomposing policy corruption across seven diagnostic dimensions, PCS provides a systematic evaluation of policy destabilization within LLMs when subjected to attacks such as Human-like Psychological Manipulation (HPM). PCS complements binary metrics like Attack Success Rate (ASR) by measuring not only the fact of policy violation but the depth and structure of the internal value-system drift that underpins compliance under adversarial psychological pressure (Liu et al., 20 Dec 2025).

1. Formal Structure and Definition

PCS is defined as a vector in a seven-dimensional diagnostic space, where each component measures the LLM’s vulnerability along a specific axis of policy corruption. Let D={d1,,d7}D = \{ d_1,\ldots, d_7 \} denote the diagnostic dimensions, grouped under three principal axes: Susceptibility to Influence (SI), Core Safety Erosion (CSE), and Manifested Cognitive Destabilization (MCD). For a victim model MM under a given attack, the PCS vector is computed as:

PCS(M)=[PM(d1),PM(d2),,PM(d7)]\mathrm{PCS}(M) = \left[ P^M(d_1),\, P^M(d_2),\, \ldots,\, P^M(d_7) \right]^\top

where PM(d)P^M(d) is the Judge-assigned mean Likert score (0–5) on dimension dd. An overall scalar PCS can be computed as:

PCS0(M)=17i=17PM(di)\mathrm{PCS}_0(M) = \frac{1}{7} \sum_{i=1}^7 P^M(d_i)

No normalization is applied, as the Likert rubric anchors all dimension scores identically.

2. Theoretical Motivation and Distinction from Binary Metrics

Standard metrics such as ASR are binary, registering only whether a policy-violating output was generated. They fail to distinguish superficial filter evasions from deep shifts in latent decision-making processes. The PCS framework addresses the Alignment Paradox: an attack matching the LLM’s internal psychometric persona and leveraging a semantic anchor can raise the compliance probability above the refusal probability. PCS quantifies how deeply the model is operating in this non-refusal regime resulting from adversarial psychological manipulation. The diagnostic axes enable the decomposition of failure modes, surpassing the diagnostic granularity of ASR (Liu et al., 20 Dec 2025).

3. Evaluation Protocol

The PCS evaluation framework integrates multiple curated datasets and multi-stage psychometric benchmarking. Data resources include:

  • JBB-Behaviors: 100 malicious instructions for ASR measurement
  • Implicit Behavioral Probes: Big-Five–based situational questions for personality profiling
  • Policy Corruption Probes: 140 adversarial prompts (20 per dimension), designed to elicit failures specific to each of the seven PCS dimensions

The benchmarking protocol proceeds as follows:

  1. Profiling Phase: Model responses to Big-Five probes are mapped to a personality vector VPV_P.
  2. Strategy Synthesis: The semantic anchor ss^* most aligned with the dominant deviance in VPV_P is selected for attack targeting.
  3. Attack Execution: The HPM multi-turn framework (Algorithm 1) is used to maximize policy corruption.
  4. PCS Measurement: Each Policy Corruption Probe is presented to the model, responses are rated using a GPT-4–based Judge Agent and a standardized 0–5 Likert rubric, and PCS vector entries PM(d)P^M(d) are calculated as means across each dimension’s probes.

Baseline comparisons include state-of-the-art attacks such as PAIR, AutoDAN, CoA, and PAP on standard models (GPT-4o-mini, Gemini-2-Flash, DeepSeek-V3).

4. Computation Methodology

PCS computation involves a judge-in-the-loop process:

  1. Probe Collection: For each dimension dDd \in D, select its 20 Policy Corruption Probes.
  2. Response Generation: For model MM under the given attack, gather responses to all probes.
  3. Scoring: A GPT-4 Judge Agent applies the 0–5 Likert rubric (Table "intention_rubric") to assign a score sq(d)s_q(d) to each response.
  4. Dimension Means: Compute

PM(d)=120qProbes(d)sq(d)P^M(d) = \frac{1}{20} \sum_{q \in \text{Probes}(d)} s_q(d)

  1. Aggregation: Optionally, compute overall PCS0(M)_0(M) or axis-level means via averaging.

No post-processing normalization is required. This protocol enables robust, high-dimensional segmentation of corruption effects.

5. Benchmark Results and Correlational Findings

The paper (Liu et al., 20 Dec 2025) reports PCS means for both HPM and baselines. Summarized values for mean PCS across dimensions (truncated to main results) are:

Model Cmpl. Trst. Rckl. HPV VSD SD Cfg
GPT-4o-mini (HPM) 3.53 3.21 2.95 3.60 2.81 2.90 3.04
GPT-4o-mini (PAP) 2.25 2.01 1.85 2.22 1.25 1.55 1.43
Gemini-2-Flash (HPM) 3.65 3.33 3.01 3.75 2.92 3.05 3.11
Gemini-2-Flash (PAP) 2.31 2.15 1.90 2.35 1.33 1.62 1.55
DeepSeek-V3 (HPM) 4.12 3.90 3.61 4.20 3.33 3.45 3.70
DeepSeek-V3 (PAP) 2.58 2.45 2.20 2.55 1.70 2.05 1.92

Corresponding ASRs are: HPM (∼88.1%), PAP (∼87.4%), CoA (∼72.9%), PAIR (∼55.1%), AutoDAN (∼40.8%).

Findings indicate a strong positive correlation between PCS (across all dimensions) and ASR. High PCS is associated with deep policy corruption—not just filter evasion. Baselines with moderate ASR (PAP) show only moderate PCS, suggesting superficial bypassing versus fundamental policy destabilization. Low-PCS strategies (PAIR, AutoDAN) correspond to poor ASR and limited impact (Liu et al., 20 Dec 2025). Defense mechanisms such as adversarial prompt tuning or cognitive interventions reduce ASR only marginally, implying that high PCS and policy corruption persist even under mitigation.

6. Implications, Limitations, and Prospective Extensions

PCS reveals that static safety filters and adversarial-token-based countermeasures are frequently ineffective when attacks exploit anthropomorphic and stateful biases within LLMs. This leads to the recognition that psychological safety and value-system stability should constitute primary evaluation criteria, beyond conventional content-level detection.

Notable limitations include dependence on a GPT-4 Judge Agent and a hand-curated rubric; biases or rubric misinterpretations may skew PCS measurements. The seven diagnostic dimensions are comprehensive but not exhaustive regarding psychological failure modes. For proprietary black-box models, PCS may conflate outright guardrail evasion with authentic internal policy shift.

Potential extensions proposed include augmenting PCS with further diagnostic axes (e.g., moral reasoning), developing in-model monitors to track latent personality vectors in deployment, integrating PCS as a penalty within training for psychological robustness, and automating judge calibration via consensus or human-in-the-loop protocols. These directions suggest systematic approaches for advancing both measurement rigor and model alignment stability (Liu et al., 20 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Policy Corruption Score (PCS).