- The paper presents REGREACT, a seven-stage, self-correcting multi-agent pipeline that achieves 94.1% structural F1 in extracting regulatory data.
- It employs iterative Observe-Diagnose-Repair loops and typed criterion graphs to ensure semantic correctness and complete reference inlining.
- The framework outperforms single-pass LLM methods with high classification accuracy, making outputs self-contained and reliable for compliance tasks.
Automated extraction of structured, machine-readable compliance data from regulatory documents remains an unsolved challenge. Existing LLM-based approaches suffer from structural hallucinations, propagation of early-stage errors, failure to enforce complex hierarchical and dependency constraints, and inability to resolve cross-document references, leading to incomplete, non-self-contained outputs. These limitations are particularly acute for legal corpora like the EU Taxonomy Delegated Acts, which encode compliance criteria with deeply nested logical architectures, implicit relationships, and extensive use of ambiguous references and footnotes.
The REGREACT framework is designed to systematically address these issues by decomposing the extraction process into a rigorously validated, multi-agent pipeline that enforces global structural and semantic correctness, leverages self-correction workflows, and guarantees full resolution and inlining of all references for downstream compliance automation.
Pipeline Architecture and Methodology
REGREACT comprises a seven-stage pipeline, each orchestrated by a dedicated specialized agent. These include: structural parsing, threshold extraction, content classification with logic inference, reference extraction, dependency resolution, footnote processing, and schema assembly. Key architectural features are as follows:
- Observe-Diagnose-Repair (ODR) Loops: Each stage operates an iterative ODR self-correction mechanism, where extracted outputs are compared against source HTML to detect issues (structural, semantic, completeness, consistency). Diagnoses are informed by domain-specific issue taxonomies, and targeted reparative prompts are issued until a confidence threshold is met or iteration cap reached, with unresolved errors escalated for human review.
- Typed Criterion Graphs: Relationships between extracted criterion nodes are encoded in a typed, cycle-free graph structure. Edges explicitly represent hierarchy, logical grouping, threshold inheritance, dependencies, references, and cross-reference corrections, facilitating global validation and reconciliation of local agent outputs.
- Shared Semantic Memory: Thresholds, cross-reference mappings, and activity metadata are persistently registered and passed across agents, ensuring cross-stage consistency.
- Criterion-Conditioned RAG Module: External and internal references (including those in footnotes) are all resolved inline via iterative, late-interaction retrieval and targeted summarization over EUR-Lex and related sources. Retrieved content is inserted directly into the schema, producing a self-contained JSON representation.
The methodology is specifically designed to overcome implicit and noisy structure present in regulatory HTML, unnumbered paragraphs, complex compliance logic expression (AND/OR/N_OF_K), dynamic threshold propagation, and ambiguous citation schemes.
Evaluation: Datasets and Metrics
Application to the EU Taxonomy Delegated Acts yields EU-TAXOSTRUCT, a large-scale structured dataset covering 242 activities with over 4,800 criterion nodes. Extraction quality is benchmarked against a manually annotated gold subset (n=100 activities) and compared against a GPT-4o single-pass baselineโrepresentative of conventional prompt-based extraction using a much larger LLM.
Metrics include:
- Structural F1 (tree alignment, parent/child/sibling placement, schema completeness): REGREACT obtains 94.1% versus baseline 78.6%.
- Classification Accuracy (criterion category, applicability, logic): Category 98.6%, Applicability 97.2%, Evaluation Logic 93.4%.
- Semantic Equivalence (LLM-judged, scale 1โ5): Threshold extraction 4.43, Reference normalization/inlining 4.77, Dependency inference 4.63, Footnote structuring 4.48.
- RAG Summary Quality (Faithfulness, Relevance, Completeness, Coverage): All mean Likert scores above 4.0, with faithfulness strongest at 4.61.
- Ablation Studies: Removing ODR or the graph module causes significant metric degradation, particularly for dependency handling and logical consistency.
These results are robust to dataset complexity, with Cohenโs K of 0.84โ0.91 in annotator agreement and strong calibration between the frameworkโs confidence estimates and empirical extraction fidelity.
Analysis of Technical Advances
Hierarchical Structuring and Logic Extraction
REGREACTโs structural parser, together with the logic inference agent, enables full reconstruction of arbitrary nesting and alternative compliance pathways, even in the presence of poor HTML markup and implicit conjunction/disjunction rules. This surpasses prior LLM-based and knowledge graph approaches, which typically generate flat or shallow structure and fail to model path-dependent requirements.
Self-Correction and Consistency Enforcement
ODR loops, grounded on explicit comparison with the original document, outperform unaided self-critique approaches in regulatory NLP. Errors due to propagation, cross-stage inconsistencies, or ambiguous references are systematically surfaced and addressed. Typed graph construction guarantees cycle-free, consistent hierarchies and allows enforcement of regulatory invariants such as evaluation participation rulesโa capability not found in earlier agentic extraction pipelines.
Self-Containedness via Criterion-Scoped RAG
By treating each criterion as a retrieval query and embedding resolved reference content inline, REGREACT ensures that downstream systems do not require access to original regulatory resources. This is a non-trivial advance over standard retrieval-augmented generation pipelines, which only augment transient model context and do not produce truly self-contained outputs. The iterative, criterion-focused retrieval plus strict summarization regime achieves high rates of faithfulness and relevance, as confirmed by independent LLM-based judges.
REGREACT demonstrates strong, quantifiable gains over a GPT-4o single-pass baseline on all dimensions. The improvement in structural F1 (+15.5%) and semantic equivalence for dependencies (+1.67 Likert) establishes that pipeline decomposition, self-correction, and explicit graph enforcement contribute more to reliable regulatory extraction than mere model scale. The grounding of corrections in source evidence is especially effective for ambiguous or mis-referenced criteria, and the inlining of RAG enrichments ensures maximal output completeness.
Implications and Future Directions
Practical Impact
REGREACTโs approach is highly relevant for automated compliance, regulatory intelligence, and AI auditing pipelines. The ability to produce fully structured, self-contained representations of complex, multi-layered regulatory criteria directly supports machine reasoning, compliance checking, and downstream explainability. Its agentic, iterative mechanisms and schema design provide a blueprint for expanding to other legal corpora and domains with comparable complexity.
Theoretical Perspectives
The framework exemplifies how task decomposition, grounding in explicit external evidence, and global structure enforcement are jointly required for robust information extraction from legal texts. In contrast to โprompt-hubโ conventions underlying most LLM legal NLP, domain-specific agent specializations, cross-agent memory, and structural invariants are essential for correctness. The paradigm underlines the continued utility of classical IR, graph theory, and formal logic within modern LLM-centered workflows.
Prospects for AI Research
Extending REGREACT to multilingual corpora and regulatory frameworks outside the EU are natural next steps. Addressing paywalled standard references, integrating advanced error-correction methods, and coupling extracted graphs to rule-execution or formal verification systems are promising research avenues. The explicit quality/confidence calibration could serve as a basis for active learning loops and further human-in-the-loop refinement.
Conclusion
REGREACT sets a new standard for structured regulatory information extraction, unifying multi-agent specialization, source-grounded self-correction, explicit graph validation, and criterion-scoped inlining of external reference content. Its superior accuracy, self-containedness, and interpretability over single-pass LLM baselines highlight the necessity of disciplined pipeline architectures for real-world legal AI. Future work will pursue broader domain generalization, enhanced correction for restricted-content references, and support for multilingual and continually evolving regulatory corpora.
Reference: "REGREACT: Self-Correcting Multi-Agent Pipelines for Structured Regulatory Information Extraction" (2604.12054)