- The paper introduces a novel top-down, caller-driven specification synthesis to automate function-level verification in LLM-generated codebases.
- It extends classical Hoare logic by integrating natural language pre/postconditions, enabling scalable, compositional reasoning over large systems.
- The evaluation demonstrates robust scalability and bug discovery, confirming hundreds of errors across systems up to 143k lines of code.
FM-Agent: Automated Hoare-Style Reasoning for Large-Scale LLM-Generated Systems
Motivation and Problem Statement
The transition of LLMs from code completion to the autonomous generation of large codebases (100k+ LoC) has intensified the need for scalable, automated verification methodologies that can accommodate developer intent in the presence of buggy, machine-generated software. Traditional compositional reasoning, via Hoare logic, offers a principled technique to decompose system correctness proofs into independent subgoals for each function; its utility in scalable verification is well-established. However, existing formal tools are fundamentally bottlenecked by the requirement for accurate, formal human-authored specifications for every function—an infeasible burden for systems of this scale or for LLM-generated code, which developers did not author and may not fully understand. Furthermore, LLM-generated programs often contain subtle errors and may lack high-fidelity documentation, making intent recovery and accurate specification particularly challenging.
FM-Agent Framework
FM-Agent introduces a comprehensive, three-component pipeline to achieve automated compositional reasoning at unprecedented scale, leveraging LLM capabilities for specification mining, Hoare-style semantic entailment, and bug validation. The core technical contributions of FM-Agent are:
- Top-Down, Caller-Driven Specification Synthesis: Departing from implementation-centric spec mining, FM-Agent generates function-level specifications by systematically propagating caller expectations down the call graph. Specifications are derived from how callers use and expect the behavior of their callees, integrating usage context and, where available, auxiliary domain knowledge. This methodology attenuates the risk of inheriting spurious behaviors from buggy implementations. Figure 1 visually shows how implementation-derived specifications can miss erroneous cases, motivating the paradigm shift.

Figure 2: The specification generated based on the implementation, which is buggy. The bolded part is incorrect.
- Generalized Hoare-Style Reasoning over Natural Language Specs: FM-Agent extends classical Hoare logic inference by supporting natural language pre/postconditions, directly leveraging the dual fluency of LLMs in both code semantics and linguistic description. Specifications and reasoning steps remain in NL, yet the inference rules are parallel to the formal case, supporting both compositionality and hierarchical propagation of semantic entailment violations.
- Automated Test Case Generation and Bug Confirmation: Upon detecting a reasoning failure (i.e., inability to confirm postcondition entailment for any path), FM-Agent instantiates concrete system-level test cases using the LLM and executes them to confirm or discard the bug, facilitating root-cause localization and reducing false positive rates.
The holistic system is summarized in Figure 3.
Figure 1: The workflow of FM-Agent.
Top-Down Specification Generation and System Scalability
Traditionally, bottom-up specification writing is favored for human-written formal specs, allowing for extensive reuse of lower-layer definitions. FM-Agent departs from this, motivated by the reality that LLM-generated implementations are not trustworthy or adequately documented, and developers often lack the intent context necessary to drive bottom-up synthesis. FM-Agent's top-down paradigm traverses the call graph in partially ordered layers. For each function, specs are mined from the aggregate of all caller expectations, with domain knowledge provided on a per-phase/coarse component basis to sidestep context-window limitations. The formal combination rule for expected specs is essentially the disjunction of preconditions and conjunction of postconditions across all caller contexts; implementation is LLM-mediated, rather than syntactically compositional (Figure 4).
Figure 5: An example of the top-down paradigm for specification generation. Each directed edge from function Fi​ to function Fj​ indicates that Fi​ invokes Fj​. The specification of Fj​ is generated based on the expected specification from all its callers.
This design enables several major scalability properties:
- Layered concurrency: Specification mining for all functions in a given layer proceeds in parallel, enhancing throughput and accommodating massive codebases (up to 143k LoC, as demonstrated).
- Phase-level modularity: Large monolithic systems are segmented into phases (e.g., compiler components) for domain-knowledge scoping and further concurrency.
- Entry-function bias: Entry points, for which intent is most easily recoverable, drive the propagation of expectations downward.
Generalized Hoare Inference via LLMs
The code reasoner module assumes, for each function, pre/postconditions and the expected contracts for transitive callees. Reasoning proceeds by path-sensitive NL entailment over grouped statements (for LLM efficiency), checking that for every path, the final postcondition entails the specified requirement. Branch and loop rules are direct analogs of the classical formalism, with disjunctive and inductive invariant generation carried out in NL, again relying on LLM inference. Where specifications are deemed precise enough, translation to formal logic (and handed to an SMT solver) is possible, augmenting the precision of the process.
Notably, although LLM hallucination and misalignment from formal contracts remain possible, FM-Agent closes the loop by concretizing suspected bugs through test-case synthesis, enhancing both empirical utility and debugging signal.
Large-Scale Evaluation
The experimental results validate strong claims:
- Scalability: FM-Agent processes four LLM-generated systems ranging from 11k to 143k LoC (~8,500 functions), with a total end-to-end cost of 2 days and 3.4B tokens.
- Bug Discovery: Across these mature codebases, 522 new bugs were found, including system-level failures such as compiler miscompilations, operating system memory corruptions, silent data errors in ML frameworks, and database query misexecutions. This is despite extensive standard pre-existing testing (unit, integration, differential, multi-agent review).
- Ablation Analysis: The top-down, caller-driven decomposition and Hoare-style reasoning are shown to provide substantial gains; a variant lacking these properties finds only one-sixth as many bugs in the largest system.
- Concurrency Efficiency: Concurrency in layer and phase-level processing is systematically quantified.
Theoretical and Practical Implications
Theoretically, FM-Agent demonstrates that scalable compositional reasoning can be automated for LLM-sized codebases when classical formal method limitations (manual spec burden, overreliance on implementations) are relaxed using large models and systematic context propagation from caller expectations.
Practically, FM-Agent fills a critical gap between lightweight testing and fully sound, but unscalable, formal verification. While it does not guarantee soundness (completeness is unattainable in general due to undecidability), it enables practical, high-throughput, high-precision bug finding, triaging, and debugging even when authoritative specifications do not exist.
Future Directions:
FM-Agent currently targets sequential programs but is amenable to integration with concurrency reasoning frameworks (e.g., rely-guarantee, separation logic). The approach points toward hybrid systems that combine LLM-derived NL specs with classical formal machinery, and suggests a pathway for specification mining, reasoning, and bug discovery in domains where intent is opaque and code provenance is machine-generated.
Conclusion
FM-Agent is the first automated, practical framework for performing function-level compositional reasoning at the scale demanded by contemporary LLM-generated codebases. Through a general, caller-driven, top-down specification mining process, natural language Hoare-style reasoning, and LLM-guided test case validation, the framework demonstrates both scalability and robustness in surfacing subtle errors in code previously validated by extensive conventional testing. The implications are significant for scalable software assurance and for bridging gaps between formal verification, specification mining, and LLM-based software development.