CALLM: Compliance Alignment for LLMs

Updated 6 March 2026

The paper introduces CALLM, a framework that rigorously aligns LLM outputs with legal standards through compliance verification and rule matching.
It employs methods such as auto-generated compliance benchmarks, reinforcement learning (e.g., GRPO), and dual-graph models to ensure rule adherence.
Empirical results show CALLMs outperform baseline LLMs with up to 11.9% improvements in compliance tasks, enhancing transparency and accountability.

A Compliance Alignment LLM (CALLM) is a paradigm for aligning LLMs with rigorously specified external standards—most prominently, legal, regulatory, or rule-based compliance criteria. Unlike conventional alignment approaches based on heuristic harms or generalized social norms, CALLMs treat compliance regimes such as the EU AI Act, the GDPR, or domain-specific statutory frameworks as the definitive reference for aligning LLM behaviors. This evolution of safety alignment transforms the compliance process into a technically auditable, rule-grounded, and verifiable foundation for model deployment in high-stakes, regulated environments (Hu et al., 26 Sep 2025).

1. Formal Definition and Theoretical Rationale

In the context of LLMs, compliance alignment is the task of steering a model's conditional output distribution so that all significant outputs adhere to an externally defined set of prescriptive rules or norms ℒ. For each query and potential response, a CALLM is required to check compliance against a finite taxonomy derived from binding legal frameworks, e.g., mapping output to clauses or articles in regulations (Hu et al., 26 Sep 2025, Xu et al., 11 Nov 2025). The atomic compliance check is performed by a verifier function $V_{comply}: x \mapsto (c_\ell, v_\ell)$ , which produces both a legal reasoning chain $c_\ell$ and a binary verdict $v_\ell \in \{0,1\}$ , with compliance ( $v_\ell=1$ ) contingent upon correct rule application and citation.

This design is motivated by the need for:

Rigor: Legal rules offer scope, exceptions, and link to real-world harms.
Systematic coverage: Regulatory texts encode broad mandates (e.g., privacy, prohibited practices, due process).
Accountability: Violations in compliance carry explicit penalties (fines, prohibitions), instantiating concrete enforcement (Hu et al., 26 Sep 2025).

2. Methodological Frameworks

CALLM implementations display architectural and algorithmic diversity, but share common design patterns:

2.1 Benchmark Construction and Scenario Synthesis

Benchmarks are constructed by mapping legal frameworks as trees, with nodes as individual clauses and root-to-leaf paths as scenario seeds. This enables auto-generation of realistic compliance-check cases—annotated with parties, background, legal arguments, and jurisdiction—from models like DeepSeek-V3.1. These cases are human-validated to assure legal fidelity (>95% alignment and narrative coherence reported) (Hu et al., 26 Sep 2025).

2.2 Optimization Algorithms

A representative learning objective is Group Policy Optimization (GRPO), which maximizes a group-normalized reward with policy constraint: $J_{GRPO}(\theta) = \mathbb{E}\left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} L_{clip}(r_{i,t}, \hat{A}_{i,t}) - \beta D_{KL}(\pi_\theta || \pi_{ref}) \right]$ where the reward functional $R_\phi$ integrates both compliance label correctness and strict output formatting, and $\hat{A}_{i,t}$ normalizes advantage across rollout groups (Hu et al., 26 Sep 2025). Reinforcement Learning strategies such as PPO are also standard (Lu et al., 2024); compliance-specific reward shaping is utilized for robust training.

2.3 Structured Data Representations

GraphCompliance introduces dual-graph modeling: a Policy Graph encoding hierarchical and cross-referential regulation structure, and a Context Graph formalizing event-driven entities and relations (SAO tuples). Alignment proceeds by embedding-policy and context nodes, constructing a similarity matrix with explicit overlap reward, and delivering CU (Compliance Unit) plans for LLM-based judgment (Chung et al., 30 Oct 2025).

2.4 Rule-Matching and Automated Judging

Models such as the Compliance Alignment Judge (CA-Judge) realize compliance as a rule-matching problem. They evaluate step-by-step justifications for every output against an auditable rubric of statutory requirements, producing a dense alignment score and rationale (Xu et al., 11 Nov 2025).

3. Empirical Evaluation and Performance

CALLMs consistently outperform both general LLMs and prior guardrail systems in legal-compliance and domain-specific benchmarks:

Benchmark (Task/Domain)	Base LLM (Accuracy/F1)	CALLM/Reasoner Variant	Delta
EU AI Act (LLM safety)	56.4%	66.9%	+10.5% (Hu et al., 26 Sep 2025)
GDPR (LLM safety)	65.4%	77.3%	+11.9% (Hu et al., 26 Sep 2025)
Modern Slavery Acts (macro-F₁, 9 criteria)	0.559	0.639	+0.08 (Xu et al., 11 Nov 2025)
GDPR Compliance Classification (micro-F₁)	49.9	up to 55.4	+4.1–7.2 (Chung et al., 30 Oct 2025)

Performance gains are strongest in chapters or rule-families with composite, multi-clause logic, such as data governance or prohibited AI practices. Structured outputs allow for direct human audit and increase model interpretability. In blind human studies, CALLM justifications were preferred over commercial LLMs in the majority of cases (Xu et al., 11 Nov 2025).

4. Extensions: Multilinguality, Rule Updates, and Model Merging

Addressing the need for cross-lingual consistency, the Multi-Lingual Consistency (MLC) loss augments monolingual alignment by enforcing rank-1 collinearity in internal representation space across translated prompts. This plug-and-play auxiliary objective enables CALLMs to extend safety alignment to high- and low-resource languages with near-zero variance in safety outcomes, achieving over 95% safety rate while preserving baseline model capability (Bu et al., 18 Feb 2026).

Compliance alignment is also preserved during model merging with the AlignMerge framework. By defining a Fisher-Rao geometry around an alignment-safe anchor, a penalty is imposed on deviation along the alignment subspace (measured via the Alignment Quality Index, AQI). This restricts merged models to remain within an "alignment-safe tube," avoiding the loss of compliance previously observed in naive weight averaging (Roy et al., 18 Dec 2025).

5. Practical Architectures and Domain Instantiations

A typical CALLM workflow comprises:

Framers: Extract regulations and generate instruction and scenario data via LLM or ontology-driven synthetic augmentation.
Instructors: Supervised fine-tuning and reinforcement learning, leveraging weighted aggregation of compliance predicates (feature-wise or rule-wise) with conflict resolution managed via learned or expert-assigned weights.
Auditors: Automated and human-in-the-loop adversarial evaluation, measuring compliance accuracy, reward statistics, and KL divergence to prevent overfitting and distributional drift (Achintalwar et al., 2024).

The ABC Align pipeline demonstrates that synthetic data generation, preference optimization, and post-training quantization can be modularly applied to instantiate CALLMs in any domain by specifying relevant rules and stakeholder objectives (Seneque et al., 2024). In high-stakes applied domains (health, finance, organizational policy), CALLMs offer a framework for scalable, transparent, and legally-grounded AI deployments.

6. Limitations, Failure Modes, and Directions

Known challenges include:

Legal Ambiguity: Indeterminate legal terms (e.g., "proportionality") can impair reliable rule-matching (Hu et al., 26 Sep 2025).
Jurisdiction Drift: Compliance taxonomies may not transfer across legal systems without additional engineering.
Evaluation Non-Identifiability: Finite behavioral testing cannot guarantee latent alignment due to normative indistinguishability under evaluation-aware policies; observed compliance provides only upper bounds on misalignment within the tested regime (Santos-Grueiro, 5 Feb 2026).
Overfitting to Format or Data Artifacts: Strict output templates may limit generalization and brittleness in downstream integrations (Hu et al., 26 Sep 2025).
Feedback and Auditing Reliance: High-quality key-rule rubrics and red-teaming workflows are essential to avoid model degeneration or unrecognized blind spots (Xu et al., 11 Nov 2025).

Further directions include multi-jurisdiction expansion, continual integration of regulatory updates, compositional reasoning over multiple rule-sources, adversarial detection of non-compliance, augmentation for non-English and multimodal alignment, and coupling with internal state interpretability or proof systems for enhanced verifiability (Seneque et al., 2024, Chung et al., 30 Oct 2025, Xu et al., 11 Nov 2025).

7. Significance and Future Prospects

CALLMs represent a formal shift in LLM alignment: from heuristic, taxonomy-limited safety to systematic, auditable, and jurisdictionally anchored compliance. The approach offers improvements not only in model safety but also in transparency, explainability, and legal accountability. By anchoring model outputs in binding normative regimes and producing step-wise, citation-grounded rationales, CALLMs facilitate practical deployment in regulated domains and set a foundation for the next generation of trustworthy AI systems (Hu et al., 26 Sep 2025, Xu et al., 11 Nov 2025, Roy et al., 18 Dec 2025).