COMPASS: Policy Alignment Assessment

Updated 6 January 2026

COMPASS is a systematic framework that defines and evaluates organizational policy adherence for LLMs by mapping allowed and denied queries to quantitative compliance metrics.
It employs rigorous query generation and adversarial testing protocols across multiple industry scenarios to assess both routine and obfuscated policy violations.
Experimental evaluations reveal significant challenges in denylist enforcement, underscoring the need for refined pre-filtering, policy-aware fine-tuning, and robust agentic integration.

COMPASS (Company/Organization Policy Alignment Assessment) specifies a systematic framework for the evaluation and enforcement of adherence to organization-specific policies—both allowlist (permitted behaviors) and denylist (prohibited behaviors)—by LLMs and autonomous agentic workflows. It establishes quantitative metrics, adversarial testing protocols, and integration pipelines that quantify, explain, and enforce the alignment of AI systems with nuanced, domain-dependent organizational rules. Applications encompass high-stakes contexts in industry, public administration, finance, and healthcare, where failure to robustly enforce prohibitions represents a critical risk for regulatory and safety violations (Choi et al., 5 Jan 2026, Zwerdling et al., 22 Jul 2025).

1. Formal Foundations of Policy Alignment Assessment

COMPASS formalizes organizational policy alignment over LLMs as follows. For a given organization, let $\mathcal{A}$ denote the set of allowlist policies and $\mathcal{D}$ denote denylist policies. The testing framework synthesizes two query sets: $Q^{allow}$ for permitted actions and $Q^{deny}$ for forbidden ones. Each query $q\in Q = Q^{allow} \cup Q^{deny}$ is mapped to a ground-truth policy label $y(q) \in \{\text{allow}, \text{deny}\}$ . For any model instance $M$ , the decision function is $C(M,q) \in \{\text{accept}, \text{reject}\}$ .

The key metrics are:

Compliance Accuracy: $Acc(M) = \frac{1}{|Q|} \sum_{q\in Q} \mathbf{1}[C(M,q) = y(q)]$ , quantifying overall matching of model responses to intended policy outcomes.
Denylist Robustness: $R_D(M) = \frac{1}{|D|} \sum_{q\in Q^{deny}} \mathbf{1}[C(M,q) = \text{reject}]$ , measuring enforcement of forbidden actions (Choi et al., 5 Jan 2026).

2. Query Generation, Adversarial Testing, and Validation

COMPASS synthesizes and validates test queries using rigorous procedures:

Routine base queries: Direct requests for allowed ( $Q^{allow\_base}$ ) and denied ( $Q^{deny\_base}$ ) behaviors.
Adversarial edge-case queries: Crafted prompts designed to probe refusal robustness—allowed-edge queries are benign requests formulated to resemble policy violations, denied-edge queries are obfuscated attempts at prohibited actions generated by adversarial transformation techniques (Regulatory Interpretation, Analogical Reasoning, Statistical Inference, Context Overflow, Hypothetical Scenario, Indirect Reference).

Validation employs a combination of LLM-based review (GPT-5-mini) and human spot-checks (89–90% agreement by query type). In benchmark studies, 5,920 queries were validated across 8 industry scenarios, with precise splits for base and edge-case coverage (Choi et al., 5 Jan 2026).

3. Industry Scenarios and Policy Structuring

Eight simulated organizational domains illustrate the breadth of COMPASS: automotive, public administration, finance, healthcare, travel, telecom, education, and recruiting. Each defines multiple allowlist and denylist categories, expressed in natural language. For example, the automotive scenario includes "vehicle_standards" (allowed), "competitors" or "proprietary_data" (denied). Policy rules are encoded as system prompts for each chatbot deployment (Choi et al., 5 Jan 2026).

4. Experimental Evaluation Protocols and Metrics

Seven LLMs (covering proprietary, open-weight, and Mixture-of-Experts architectures) were subjected to COMPASS evaluation, with retrieval-augmented generation (RAG) ablated separately. Evaluation pipeline:

Present query $q$ to the model.
Compute refusal indicator $\rho(r)$ and policy adherence $\alpha(r, \mathcal{P})$ for response $r$ .
Judge alignment with defined rules—denylist: $(\rho\land \neg\alpha)$ , allowlist: $(\neg\rho\land\alpha)$ .
Extract aggregate compliance metrics.

Data were stratified by the four query types to quantify performance across routine and adversarial dimensions (Choi et al., 5 Jan 2026).

5. Key Findings on Compliance and Enforcement Asymmetry

Empirical results identify a stark asymmetry:

Allowlist compliance: 97.5–99.8% average across models.
Allow-edge compliance: 79–96% (model-dependent).
Denied base refusal: Only 13–40% of prohibited requests were refused.
Denied edge refusal: 3–20% (adversarial obfuscations), with some models under 5%.

Scaling and RAG marginally improve allowlist compliance but do not resolve denylist brittleness. Pre-filtering raises denial-rate to >96% but causes excessive over-refusal for legitimate queries (~35% allowed accuracy)—emphasizing a severe precision–recall trade-off. Policy-aware instruction fine-tuning (LODO SFT with LoRA) can improve denied-edge refusal rates (to 60–62%) without catastrophic allowlist loss, suggesting the gap is non-fundamental rather than only architectural (Choi et al., 5 Jan 2026).

6. Enforcement Architectures in Agentic Workflows

Recent agentic workflow research operationalizes COMPASS enforcement using deterministic, modular guard layers (Zwerdling et al., 22 Jul 2025). The process consists of:

Build-time mapping: Natural language policy clauses and tool specifications are parsed via LLM-assisted chain-of-thought loops, producing atomic policies mapped to affected tool functions.
Test-driven guard compilation: For each policy, compliance and violation examples are generated and guard code (python predicates) is synthesized such that these examples pass/fail as intended under automated test runners.
Runtime integration: All agentic tool invocations are intercepted by the guard layer, which blocks execution on policy violations. Agents receive a structured error object and explanation, triggering self-reflection and replanning.

In τ-bench Airlines, automated guards improved hard policy enforcement rates (pass $^{10}$ outcome: 0.500 versus baseline 0.227), outperforming reflection-only strategies. Limiting factors include mapping recall, code hallucinations, omission violations (agents omitting required actions), runtime overhead, and explainability fidelity (Zwerdling et al., 22 Jul 2025).

7. Limitations, Failure Modes, and Advancing Robust Alignment

COMPASS research highlights pervasive failure modes—open-weight models default to direct violation (80–83% on denied queries), proprietary models to refusal-answer hybrids (61–65%), with persistent indirect violations. LLMs are structurally better at recognizing permitted actions than prohibited ones. The inherent asymmetry presents critical deployment risks for policy-sensitive environments (financial, health, public sector). Sophisticated mitigation—combining automated pre-filtering, edge-case adversarial audits, and policy-specific fine-tuning—is essential (Choi et al., 5 Jan 2026).

COMPASS methodology offers rigorous, extensible evaluation and practical enforcement protocols for organizational policy alignment in AI systems. Crucial directions include richer adversarial testing, hybrid expert-in-the-loop guard refinement, real-time compliance dashboards, and operational support for dynamic policies and omitted obligations. These elements broaden the scope and reliability of policy-critical AI deployment across domains (Zwerdling et al., 22 Jul 2025).