Towards an Agentic LLM-based Approach to Requirement Formalization from Unstructured Specifications

Published 20 Apr 2026 in cs.SE | (2604.18228v1)

Abstract: Early-stage specifications of safety-critical systems are typically expressed in natural language, making it difficult to derive formal properties suitable for verification and needed to guarantee safety. While recent LLM-based approaches can generate formal artifacts from text, they mainly focus on syntactic correctness and do not ensure semantic alignment between informal requirements and formally verifiable properties. We propose an agentic methodology that automatically extracts verification-ready properties from unstructured specifications. The modular pipeline combines requirement extraction, compatibility filtering with respect to a target formalism, and translation into formal properties. Experimental results across three scenarios show that the pipeline generates syntactically and semantically aligned formal properties with a 77.8% accuracy. By explicitly accounting for modeling and verification constraints, the approach is a paving step towards exploiting AI to bridge the gap between informal descriptions and semantically meaningful formal verification.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a modular, LLM-driven pipeline that extracts, classifies, and formally translates natural language requirements into verification-ready properties.
It achieves an 81.8% extraction match and 77.8% semantic accuracy by filtering unverifiable constraints and ensuring logical equivalence.
The agentic design offers explicit traceability and a domain-agnostic framework that enhances scalability and rigor in safety-critical systems.

Agentic LLM-Based Formalization of Requirements from Unstructured Specifications

Motivation and Background

The formalization of system requirements from unstructured, natural language specifications is a persistent challenge in the design and verification of safety-critical systems. Conventional model-driven engineering and formal verification approaches require precise, formal properties, but early-stage specifications are typically provided in ambiguous natural language. Prior automated and semi-automated methods utilize NLP pipelines, domain ontologies, or rule-based extraction, but these require substantial customization and often do not scale across domains. Recent adoption of LLM-based generative AI for requirements translation delivers greater flexibility, yet existing methods primarily focus on syntactic correctness, neglecting the essential semantic alignment of formal properties with the original specification intent.

This paper addresses the critical gap of semantic alignment by introducing an agentic LLM-based pipeline that operationalizes the requirement formalization cycle, centering on both syntactic and semantic validity. The methodology leverages multi-agent LLM roles and integrates formal verification tools within its workflow, providing explicit traceability and modularity for compliance-oriented requirement elicitation.

Agentic Pipeline Architecture

The pipeline comprises three distinct LLM-driven stages:

Requirement Extraction: The initial LLM parses an unstructured specification and extracts atomic requirements in normalized natural language, isolating functional and safety constraints. This stage employs standard requirements syntax and exports results in a machine-readable JSON schema, facilitating downstream processing.
Verifiability Classification: A specialized LLM judge then filters requirements not expressible within the boundaries of the target formal model (here, SHA and UPPAAL SMC), removing those referring to unobservable conditions, qualitative intentions, or model-external phenomena. This gatekeeping step preemptively prevents mathematical inexpressibility and ensures efficient subsequent translation.
Formal Translation: The eligible requirements are transformed via LLM into formal properties compliant with the BNF grammar of the target query language (UPPAAL SMC, MITL). This stage programmatically checks syntactic correctness and, crucially, employs a secondary LLM judge to assess semantic equivalence with the original specification, considering logical commutativity, abstraction levels, and formal equivalences.

The agentic workflow is domain-agnostic but instantiated on HRI scenarios, utilizing the LIrAs DSL to generate SHA models amenable to UPPAAL verification.

Experimental Results and Numerical Performance

Quantitative evaluation across three cyber-physical scenarios (Coffee_Delivery, User_Guided_Transport, Factory_Pipeline) demonstrates strong pipeline efficacy:

Requirement Extraction: Combined exact and partial semantic matches reach $81.8\%$ of generated requirements. The LLM reliably captures core operational logic (navigation, synchronization), but recurrent deficiencies arise in rigorous formal prerequisites (logical boundaries, failure detection) and in occasional hallucination of plausible, but unstated, domain constraints.
Verifiability Classification: Overall accuracy for filtering requirements achieves $88.7\%$ , with recall of $94.2\%$ for valid constraints. The LLM consistently discards unverifiable artifacts, minimizing false positives and improving pipeline robustness. This classification phase protects the translation stage from infeasible queries and compiler errors.
Formal Translation: Syntactic correctness of the generated UPPAAL queries is $95.8\%$ . Exact textual matches with ground truth are $34.7\%$ , but semantic evaluation (via LLM-based judge with explicit equivalence rules) elevates effective accuracy to $77.8\%$ . The model demonstrates flexibility in recognizing and producing logically equivalent formulations, validating the importance of semantic evaluation beyond rigid string comparison.

These results substantiate the claim that the agentic pipeline not only ensures structural integrity but also delivers high fidelity in semantic mapping from natural language specifications to formal verification artifacts.

Theoretical and Practical Implications

The modular, agentic LLM pipeline presented in the paper advances the requirements formalization paradigm by explicitly incorporating semantic evaluation alongside syntactic correctness. This approach allows for:

Traceable requirement-to-property mapping: Each pipeline stage is independently evaluable, enabling detailed analysis and refinement.
Domain-transparency: Minimal adjustments are required for adapting the methodology to different domains or formal verification frameworks.
Semantic pre-filtering: Early removal of unverifiable constraints improves computational efficiency and prevents propagation of errors.
Robust semantic evaluation: Integrating LLM-based judges capable of detecting logical equivalences increases query translation accuracy, revealing the inadequacy of strict syntactic tests for formal property generation.

These developments have significant implications for practical requirements engineering in system verification, facilitating rapid, scalable transformation from informal descriptions to verification-ready properties. Theoretically, the pipeline reinforces the necessity of semantic rigor and provides a replicable modular architecture for further research in automated requirements formalization.

Future Directions

Further research will focus on intrinsic automatic metrics for intermediate outputs, enabling iterative refinement and mitigation of error propagation within the pipeline. Larger-scale empirical studies will incorporate more scenarios, models, and LLM architectures, exploring fine-tuning and enhanced prompt design to address recurring failure modes. Additionally, intrinsic semantic evaluation techniques may be devised to circumvent subjectivity in LLM-based judging and provide formal guarantees of alignment.

Conclusion

The agentic LLM-based methodology delineated in this paper delivers reliable extraction, classification, and formal translation of requirements from natural language specifications. Strong quantitative results validate the pipeline's syntactic and semantic performance, particularly in achieving $77.8\%$ semantically aligned formal query accuracy. This architecture offers a practical and extensible framework for bridging the gap between informal system descriptions and rigorous, verification-ready properties, with broad applicability across safety-critical domains and formal verification tools.