RegGuard: Dual Systems for Compliance & Security

Updated 1 February 2026

The paper introduces an AI-powered RAG assistant using HiSACC and ReLACE to automate pharmaceutical regulatory compliance with enhanced traceability and answer faithfulness.
The paper presents a compiler-driven security mechanism that applies security-aware register allocation and MAC-based stack authentication to mitigate control- and data-oriented attacks.
Empirical evaluations demonstrate improved metrics—including a faithfulness test of 0.9252 and reduced over-retrieval penalties—confirming RegGuard's robust performance in high-risk contexts.

RegGuard denotes two distinct advanced systems in computer science: (i) an AI-powered Retrieval-Augmented Generation (RAG) assistant for pharmaceutical regulatory compliance (Yang et al., 25 Jan 2026), and (ii) a compiler-driven register allocation and stack integrity mechanism for mitigating control- and data-oriented attacks in software security (Geden et al., 2021). Each system leverages domain-specific innovations to achieve robust, auditable automation in high-risk contexts.

1. RegGuard for Pharmaceutical Regulatory Compliance

RegGuard is an industrial-scale, enterprise-grade RAG assistant designed to automate the interpretation of heterogeneous regulatory texts and reconcile them against internal corporate policies in highly regulated sectors such as pharmaceuticals (Yang et al., 25 Jan 2026). Its architecture addresses key challenges endemic to regulatory compliance, including document heterogeneity, semantic fragmentation, and LLM hallucination risks.

The system integrates:

Secure, audited ingestion of documents (including PDFs, DOCX, XLSX, CSV, Google Workspace formats, and scanned images) from corporate repositories, normalized into text blocks with provenance-tracked metadata, using OAuth 2.0 and a local SQLite manifest for incremental updates.
Two novel algorithmic modules: HiSACC (Hierarchical Semantic Aggregation for Contextual Chunking) and ReLACE (Regulatory Listwise Adaptive Cross-Encoder), optimizing retrieval relevance and LLM answer faithfulness.

This design enables rapid, traceable assimilation of regulatory updates and provides compliance practitioners with verifiable, contextually anchored responses to complex regulatory questions.

2. HiSACC: Hierarchical Semantic Aggregation for Contextual Chunking

Historically, splitting long regulatory documents with recursive character splitting (RCS) results in semantic fragmentation—breaking apart related content that is non-contiguous, such as definitions scattered between main text and appendices. HiSACC introduces a two-level hierarchical grouping:

Stage 1 (Local Aggregation): The document is decomposed into minimal semantic units $S=\{s_1, \dots, s_k\}$ , embedded as $v_i = \phi(s_i)$ . Adjacent units are greedily merged based on cosine similarity $M_{i,i+1}$ exceeding threshold $\theta$ , yielding coherent local groups.
Stage 2 (Long-Range Aggregation): Within a skip window of size $w$ , inter-group cosine similarity $\mathrm{Sim}(G_a, G_b)$ exceeding $\gamma$ triggers non-adjacent group merges, provided merged tokens fit under model constraints.

This process yields compact, semantically coherent segments for retrieval, shown to reduce top-K noise and improve passage selection rates. Empirical evaluation shows that HiSACC alone elevates faithfulness test (FT) and groundedness rate (GR) metrics over baseline RCS.

3. ReLACE: Regulatory Listwise Adaptive Cross-Encoder Reranker

Standard dense retrieval uses independent query and passage embeddings, scoring by cosine similarity. However, regulatory language involves complex scope conditions and exceptions that require deeper interaction modeling. ReLACE, based on a Transformer cross-encoder (initialized from bge-reranker-base), jointly encodes the query and each candidate passage:

Input Construction: $[\mathrm{CLS}]\ q\ [\mathrm{SEP}]\ d_i\ [\mathrm{SEP}]$
Listwise Training: After dense retrieval, $K$ candidates are re-ranked using scores $s_i$ , with the listwise cross-entropy loss:

$\mathcal{L}(q,\{d_i\}) = -\sum_{i=1}^K y_i\;\log P(d_i\mid q)$

where $P(d_i\mid q) = \frac{\exp(s_i)}{\sum_{j=1}^K \exp(s_j)}$ , labels $y_i$ reflect normalized ground-truth relevance.

Fine-tuning on a large in-house set of query–passage pairs, including hard negatives, adapts ReLACE to domain noise. Combined with HiSACC, this method achieves the highest answer faithfulness and lowest over-retrieval penalty in benchmarked settings.

4. System Evaluation and Compliance Impact

The system was evaluated using a dataset of approximately 600 compliance documents from a major pharmaceutical enterprise, with multilingual query–answer pairs. Relevant evaluation metrics included:

Answer Relevance, Context Relevance, Groundedness Rate, Answer-Source Match
File ID Match, Context Coverage, Over-Retrieval Penalty
Language Fluency and Faithfulness Test

At $K=15$ , combining HiSACC and ReLACE yields a faithfulness test of $0.9252$ (an improvement of $+0.036$ over RCS), groundedness rate of $0.8453$, and the lowest over-retrieval penalty ($0.0054$). Statistical significance holds at $p<0.01$ (paired bootstrap). Qualitative assessments noted major reductions in unsupported statements and follow-up queries, indicating significant mitigation of LLM hallucination risk.

5. Auditability, Traceability, and Security Features

RegGuard maintains strict auditability and provenance:

Metadata (file/chunk ID, checksum, timestamps, hierarchical history) persists through pipeline transformations and is never altered in place.
Top-K retrievals and queries are logged with full provenance, enabling evidentiary trace-back.
Role-based access controls, integrated with enterprise SSO, restrict sensitive document access.
Incremental indexing allows daily or sub-hourly ingestion of new or updated compliance materials without corpus-wide reprocessing.

These guarantees make RegGuard responsive to live compliance environments and suitable for domains with stringent traceability mandates.

6. Generalization, Modularization, and Future Work

The RAG framework of RegGuard, though tailored for pharmaceutical regulation, is modular and transferable to compliance-heavy sectors such as financial auditing, environmental law, and aerospace certification. The only requirement is retraining HiSACC and ReLACE on the new domain corpus.

Identified future directions include:

Integrating a Model Context Protocol for direct, live database access by LLMs.
Reinforcement Learning from Human Feedback (RLHF) to optimize retrieval and generative modules for user-specific reward signals.
Extension of HiSACC for multimodal semantic chunking (supporting tables, figures) and graph-based groupings.

7. RegGuard for Control- and Data-Oriented Attack Mitigation

A separate instantiation of RegGuard operates as a compiler-level register allocation and call-frame authentication mechanism to secure program execution against control- and data-oriented memory attacks (Geden et al., 2021). Key principles:

Threat Model: Adversary with arbitrary stack read/write but no register access.
Security-Aware Register Allocation: Variables receive a "security score" prioritizing criticalness (e.g., pointers, branch variables). Allocation is globally optimized:

$\text{maximize } \sum_{i=1}^n s_i \cdot x_i \quad \text{s.t.}\ \sum_{i: live(i,p)} x_i \leq R,\ x_i \in \{0,1\}$

Modified allocators ensure the highest-priority variables remain in registers.

Stack Frame Authentication: Uses a per-process cryptographic key, never spilled from a reserved register, to MAC the values of saved registers per call-frame. This chaining prevents tampering or replay across calls.

Implementation on ARM64 features an LLVM pass for score annotation, modified linear-scan register allocation, and prologue/epilogue instrumentation to compute, verify, and chain MACs using cryptographic primitives (HMAC-SHA256 or SipHash). Benchmarks on Apple M1 show mean overheads of $13\%$ (program-only, SHA256) and up to $33\%$ (libc instrumented). Limitations include lack of heap/global data protection and increased spill rates in low-register architectures.

Variant	Primary Domain	Core Approach
RegGuard (RAG)	Regulatory Compliance (Pharma, etc)	Retrieval-augmented language generation, semantic chunking, reranking
RegGuard (Security)	Systems Security (software, OS)	Security-prioritized register allocation, MAC-authenticated stack frames

Both systems are architected for auditable robustness under strong adversarial conditions, leveraging domain-specific optimization—semantic retrieval and LLM traceability for regulatory compliance, or register/stack protection for system security—without necessitating new hardware or fundamental changes to program logic.

Markdown Report Issue Upgrade to Chat

References (2)

RegGuard: AI-Powered Retrieval-Enhanced Assistant for Pharmaceutical Regulatory Compliance (2026)

RegGuard: Leveraging CPU Registers for Mitigation of Control- and Data-Oriented Attacks (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RegGuard.