- The paper introduces SNCA, a three-phase framework to quantify LLM internal alignment with explicit self-stated safety policies.
- It evaluates four state-of-the-art models across 45 harm categories using more than 47,000 outputs to reveal significant gaps between declared rules and behavior.
- The study demonstrates an architecture-induced trade-off between policy articulability and behavioral consistency, challenging current LLM alignment methods.
Reflexive Auditing of LLM Safety: Evaluating Symbolic-Neural Policy Consistency
Introduction
This paper systematically examines the alignment of LLMs' explicit self-stated safety policies with their actual behavioral manifestations. It introduces the Symbolic-Neural Consistency Audit (SNCA), a structured framework that elicits a model's own policy statements, formalizes them as logic predicates (Absolute, Conditional, Adaptive), and audits for behavioral compliance across three safety benchmarks. The work is distinguished by its reflexive approach, treating models as both policy author and behavioral subject, and defining Symbolic-Neural Consistency Score (SNCS) as a quantitative measure of internal alignment.
The SNCA Framework
SNCA is a three-phase pipeline explicitly isolating policy elicitation from behavioral measurement:
- Policy Extraction and Typing: Each model, for each harm category, is queried using a highly structured template (five probing questions), demanding explicit, testable statements regarding refusal, compliance conditions, and sensitivity to framing. Policy statements are classified by an external LLM judge into one of four types: Absolute, Conditional, Adaptive, or Opaque.
- Behavioral Testing: Models are evaluated on thousands of benchmark prompts per category using a neutral, non-inductive system prompt. Behavioral outputs are deterministically classified (REFUSE, COMPLY, PARTIAL) via a hybrid keyword and LLM-judging protocol.
- Deterministic Consistency Scoring: Compliance of observed behavior with predicted policy is computed, with fine-grained violation taxonomy: Abs-Comply (absolute policy violated by observed compliance), Cond-Leak (conditional policy violated outside stated exceptions), and Frame-Mismatch (behavior disagrees with declared framing sensitivity).
Figure 1: Overview of the SNCA framework, illustrating the extraction and typing of self-stated policies and their comparison with behavior on harm benchmarks.
Experimental Setup and Evaluation Protocol
Experiments span four SOTA and open-weight models (GPT-4.1, DeepSeek-V3.1, Llama-3.3-70B-Instruct, o4-mini) across 45 granular harm categories. The behavioral space includes 47,496 LLM outputs on SORRY-Bench (harmful requests with mutations), XSTest (safe/unsafe distinction), and OR-Bench (boundary and false-positive scenarios), ensuring robust coverage of difficult and ambiguous safety regimes.
Policy extraction was repeated for stability, and all scoring was performed in a temperature-zero deterministic setting. Rules were independently classified by DeepSeek-V3.1 and validated via secondary judging, removing circularity and increasing annotation reliability.
Analysis of Policy Diversity and Cross-Model Agreement
The extracted policy landscape is highly heterogeneous across models:
- GPT-4.1 predominantly articulates Adaptive rules, highlighting nuanced, context-sensitive policies.
- DeepSeek and Llama overclaim Absolute policies for the majority of categories.
- o4-mini uniquely produces a notable fraction (29%) of Opaque (unclassifiable) responses, indicating an articulability ceiling even for highly capable models.
Cross-model policy agreement is very low (11% of categories), with even the closest model pairs agreeing on rule types for at most 71% of categories.
Figure 2: Self-reported rule type distribution and violation counts across harm categories by model.
Self-Consistency Results and Failure Modes
Self-consistency scores (SNCS) reveal large, architecture-driven gaps:
- o4-mini: SNCS = 0.800 (highest), but with 29% opacity rate.
- GPT-4.1: SNCS = 0.545, with a nuanced policy mix.
- DeepSeek: SNCS = 0.294
- Llama: SNCS = 0.245
Within non-reasoning models, high policy articulability comes at the cost of increased behavioral inconsistency. Notably, the vast majority of violations (over 80%) are Abs-Comply: models explicitly assert absolute refusal but in practice comply with harmful prompts, especially once paraphrasing and linguistic mutation is introduced.
Figure 3: SNCS by rule type and model, and correlation trend between frequency of Absolute rules and overall self-consistency.
Figure 4: SNCS distribution by rule type per model, indicating variance in Absolute self-reported rules and superior median performance for o4-mini.
Figure 5: Heatmap of SNCS for all categories and models, with opacity highlighted.
Implications are particularly acute in “grey zone” categories (e.g., religious proselytizing, risky financial advice) where explicit RLHF signal or societal consensus is limited—every model exhibits systematic mismatches between stated policy and revealed behavior.
Robustness Under Linguistic Mutation and Adversarial Framing
All models, regardless of architecture, show steep drops in refusal rates and self-consistency when harmful prompts are subjected to paraphrase and minor mutations. For example, GPT-4.1’s refusal rate on SORRY-Bench drops from 67% to 23% under mutation, DeepSeek from 66% to 14%. No model achieves category-level SNCS > 0.72 once mutations are included, indicating fragile encodings of safety boundaries and internal policies.
Trade-Off Between Articulability and Consistency
A central finding is the architecture-induced trade-off: reasoning-dominated models (e.g., o4-mini) are more behaviorally consistent when they provide articulated policies but become increasingly opaque in less explicit or ambiguous categories. Conversely, instruction-optimized models provide explicit rules for all categories, at the expense of widespread behavioral contradiction.
Implications and Future Directions
SNCA exposes the nontrivial gap between self-reported and internalized LLM policies, raising concerns for alignment auditing methodologies that rely solely on behavioral or external benchmarks. The low cross-model convergence in safety policy—despite nearly identical training and deployment regimes—highlights the lack of implicit “universal” alignment and the necessity of direct, reflexive consistency auditing.
Conclusion
This work operationalizes and rigorously quantifies the reflexive alignment of LLMs, revealing substantial, systematic, and architecture-driven gaps between explicit policies and actual output. The findings underscore the necessity of integrating reflexive consistency audits like SNCA as a core component of safety evaluation, complementing behavioral and adversarial benchmarks. These results challenge the adequacy of current alignment verification and motivate a more introspective paradigm, in which the boundaries claimed by models are actively reconciled with the behaviors they manifest.