VeriSpecGen: Verification-Guided Spec Generation

Updated 4 July 2026

VeriSpecGen is a class of methods that automatically generate formal specifications from natural language, code, or RTL, treating them as dynamic, machine-checkable artifacts.
It employs iterative refinement using techniques like traceability maps, formal equivalence checking, and mutation-based diagnostics to ensure semantic faithfulness.
Empirical evaluations demonstrate significant improvements in specification synthesis, with enhanced verification metrics and robust counterexample-driven feedback.

Searching arXiv for papers on VeriSpecGen and closely related verification-guided specification generation. VeriSpecGen denotes a class of methods for verification-guided specification generation in which specifications are not treated as static textual artifacts, but as objects that must survive machine-checkable validation. The collected literature suggests two closely related usages of the term: a specific traceable-refinement framework for synthesizing intent-aligned Lean specifications from natural language, and a broader paradigm in which specifications are generated from code, RTL, or structured designs and then iteratively refined using formal equivalence checking, theorem proving, symbolic refutation, executable tests, or benchmark-driven adversarial evaluation (Ye et al., 12 Apr 2026). Across these usages, the central premise is stable: specification quality is judged by semantic faithfulness and by the ability of formal tools to expose omissions, ambiguities, and overconstraints that ordinary verifier pass rates can miss.

1. Conceptual scope and lineage

The broader VeriSpecGen idea addresses a longstanding gap between informal intent and formal verification. In cryptographic protocol analysis, MetaCP stores protocol designs in a structured PSV model and exports them to Tamarin, ProVerif, and a C++ implementation, explicitly targeting the interpretation and modeling stage rather than only the final verification stage (Metere et al., 2021). In C verification, AutoSpec synthesizes ACSL preconditions, postconditions, loop invariants, and assigns clauses, while a verifier filters and accumulates only those specifications that are satisfiable and adequate for a full proof (Wen et al., 2024). In RTL comprehension, SpecLoop generates natural-language specifications from RTL and then validates them by reconstructing RTL and checking bounded sequential equivalence against the original design (Chang et al., 3 Mar 2026).

A more explicit formalization appears in the Lean-based framework titled "Intent-aligned Formal Specification Synthesis via Traceable Refinement" (Ye et al., 12 Apr 2026). There, VeriSpecGen decomposes a natural-language description $P$ into atomic requirements $R=\{r_1,\dots,r_m\}$ , drafts a specification $S=\langle pre(\vec{x}), post(\vec{x}, y)\rangle$ , generates requirement-targeted tests, and uses traceability maps to repair only the clauses implicated by failing tests. This formulation makes requirement attribution and localized repair first-class.

Other systems adopt the same underlying objective with different formalisms. Verus-SpecGym defines faithfulness as equality between the intended relation and the formalized one, $R_{s_F}=R_{s_I}$ , and further decomposes this into soundness and completeness of pre_spec and post_spec (Agarwal et al., 26 May 2026). VeriAct frames the task for JML as synthesizing preconditions $P$ and postconditions $Q$ that are both correct and complete, while showing that verifier acceptance alone is not sufficient evidence of either property (Misu et al., 31 Mar 2026). This suggests that VeriSpecGen is best understood not as a single toolchain, but as a family of verifier-coupled synthesis methods whose common purpose is to align formal specifications with intended semantics.

2. Formal structure of specification faithfulness

A recurring pattern in VeriSpecGen systems is the replacement of purely linguistic evaluation with explicit semantic obligations. In the traceable-refinement formulation, tests are partitioned into positive tests $T^+$ , negative-input tests $T^{in-}$ , and negative-output tests $T^{out-}$ , with validation predicates

$\phi^+(S,t):=pre(\vec{x})\land post(\vec{x},y),$

$R=\{r_1,\dots,r_m\}$ 0

$R=\{r_1,\dots,r_m\}$ 1

and a traceability map $R=\{r_1,\dots,r_m\}$ 2 used to localize repair (Ye et al., 12 Apr 2026).

SpecLoop instantiates the same idea for hardware equivalence. Given an RTL design $R=\{r_1,\dots,r_m\}$ 3, it generates a specification $R=\{r_1,\dots,r_m\}$ 4, reconstructs RTL $R=\{r_1,\dots,r_m\}$ 5, and proves bounded sequential equivalence

$R=\{r_1,\dots,r_m\}$ 6

with Yosys EQY at depth $R=\{r_1,\dots,r_m\}$ 7 (Chang et al., 3 Mar 2026). Its miter-based error flag

$R=\{r_1,\dots,r_m\}$ 8

turns specification errors into concrete counterexample traces that can be fed back to refinement.

SpecSyn introduces a different strength criterion for ACSL generation. Its verifier function $R=\{r_1,\dots,r_m\}$ 9 returns refuted specifications, and it measures semantic strength through the variant discriminative rate

$S=\langle pre(\vec{x}), post(\vec{x}, y)\rangle$ 0

optimizing specifications subject to $S=\langle pre(\vec{x}), post(\vec{x}, y)\rangle$ 1 (Ma et al., 23 Apr 2026). Here, strong specifications are those that both verify on the original program and refute many semantic-non-equivalent program variants.

VeriAct formalizes postcondition and precondition quality through symbolic harness checks rather than whole-program verification alone. It defines PostCorr, PostComp, PreCorr, and PreComp over curated valid pairs, invalid inputs, and output mutants, and uses the Meaningfully Verified Rate to separate verifier-accepted but semantically weak specifications from genuinely informative ones (Misu et al., 31 Mar 2026). Across these systems, specification faithfulness is operationalized not by prose quality but by discriminative semantic tests, equivalence obligations, or proof obligations.

3. Verifier-in-the-loop architectures

The architectural core of VeriSpecGen is an iterative loop in which generation, validation, diagnosis, and repair are separated rather than conflated.

System	Target artifact	Validation and repair signal
VeriSpecGen (Ye et al., 12 Apr 2026)	Lean `pre`/`post` specs	Lean validation, traceability maps, adversarial tests
SpecLoop (Chang et al., 3 Mar 2026)	RTL natural-language specifications	EQY bounded sequential equivalence and counterexamples
SpecSyn (Ma et al., 23 Apr 2026)	ACSL contracts and loop invariants	Frama-C/WP plus mutation-based variant discrimination
VeriAct (Misu et al., 31 Mar 2026)	JML specifications	OpenJML plus Spec-Harness metrics
MSG (Fu et al., 29 Sep 2025)	Move Specification Language clauses	Move Prover, error routing, AST-mutation coverage

SpecLoop is representative of reconstruction-based validation. Its controller orchestrates specification generation, RTL reconstruction, compilation, formal equivalence checking, and refinement. It distinguishes invalid original RTL, non-compilable reconstructions, functional mismatches, and inconclusive outcomes, and only counterexamples or compiler logs are fed back to the specification generator; formal-equivalence logs are deliberately hidden from the reconstructor so that the specification remains the sole behavioral source (Chang et al., 3 Mar 2026).

SpecSyn and AutoSpec are representative of decomposition-driven workflows. SpecSyn builds an AST, constructs a dependency graph, computes SCCs via Tarjan’s algorithm, generates specification sketches for each segment, and then iterates between ACSL synthesis, verifier repair, and mutation-based strengthening (Ma et al., 23 Apr 2026). AutoSpec similarly builds an extended call graph whose nodes include functions and loops, traverses it bottom-up, inserts >>> INFILL <<< placeholders, and validates every generated ACSL clause before it can influence later rounds (Wen et al., 2024).

MSG shows how the same design pattern adapts to a verifier-centric language. It splits Move specification synthesis into aborts_if, modifies, ensures, and loop-invariant agents, runs them over both inlined and non-inlined contexts, merges outputs with a spec-ensembler, and routes Move Prover diagnostics back to the relevant agent class (Fu et al., 29 Sep 2025). The neuro-symbolic framework for memory-aware C specifications adopts another variant: it generates separation-logic-style function specifications, validates them with QCP and Coq, and then constructs negated Coq examples so that Hammer can machine-check refutations of plausible but incorrect candidates (Zhang et al., 12 Mar 2026).

A distinct but related branch is data-centric generation. MetaCP centralizes protocol design in PSV, an XML data model constrained by a DTD, and uses plugins to export consistent Tamarin, ProVerif, and C++ artifacts from a single structured source (Metere et al., 2021). In this setting, VeriSpecGen is less an iterative repair loop than a typed interpretation pipeline whose correctness is strengthened by multi-target export and target-specific verification plugins.

4. Evaluation infrastructures and quantitative results

Empirical work on VeriSpecGen has expanded from individual tools to full benchmark ecosystems. The traceable-refinement VeriSpecGen framework achieves 86.6% on VERINA SpecGen task using Claude Opus 4.5, improving over baselines by up to 31.8 points across different model families and scales (Ye et al., 12 Apr 2026). It also produces 343,827 SFT examples from refinement trajectories, and training on these trajectories improves specification synthesis by 62-106% relative while transferring gains to general reasoning abilities (Ye et al., 12 Apr 2026).

SpecLoop reports that verification-guided variants consistently outperform Single Round baselines in RR-Score. On VerilogEval, for example, Qwen3-Coder-30B improves from 0.722±0.016 in Single Round to 0.759±0.017 with Pass/Fail feedback and 0.795±0.021 with Full Diagnosis, while DeepSeek-V3.1 improves from 0.865±0.019 to 0.908±0.011 and 0.940±0.003 under the same sequence (Chang et al., 3 Mar 2026). On RTLLM, GLM-4.6 improves from 0.813±0.034 to 0.847±0.009 and 0.880±0.000 (Chang et al., 3 Mar 2026).

SpecSyn reports overall precision above 90% and outstanding recall over 75%, with GPT-5: 96.68% average precision and 75.91% recall, and it successfully handles 1071 out of 1365 target properties for open-source programs (Ma et al., 23 Apr 2026). VeriAct reports +5% MVR improvement over best prompt-optimized baselines on SpecGenBench and +12% MVR improvement on FormalBench using GPT-4o with up to three refinement cycles (Misu et al., 31 Mar 2026). MSG generates fully verifiable Move specifications for 84% of functions (300/357), produces 57% more verifiable clauses than conventional designs, and in an all-in-one ablation adding Move Prover feedback yields a 30% increase in generated verifiable specifications (Fu et al., 29 Sep 2025).

Benchmark work also shows how evaluation methodology changes conclusions. VeriScale expands Verina test suites by over 83× in VerinaPlus and provides a 14× VerinaLite variant, exposing substantial score drops on both SpecGen and CodeGen that were hidden by sparse original tests (Bai et al., 21 May 2026). VeriContest scales verifiable code-generation evaluation to 946 Rust/Verus tasks and reports that the strongest model reaches 92.18% on natural-language-to-code generation, 48.31% on specification generation, 13.95% on proof generation, and 5.29% end-to-end (Xie et al., 8 May 2026). Verus-SpecGym, focused specifically on autoformalization, reports Pass@1 of 0.778 for Gemini 3.1 Pro on 581 Verus specification-writing tasks, while other frontier models reach 0.511--0.578 and OSS models 0.215--0.255 (Agarwal et al., 26 May 2026).

5. Failure modes, controversies, and evaluation pitfalls

A major theme in the literature is that verifiability is not equivalent to faithfulness. VeriAct shows a large gap between Verification Rate and Meaningfully Verified Rate: Houdini reaches 86.7% VR on SpecGenBench but collapses to approximately 2% MVR, and on FormalBench it drops from 54.2% VR to 0% MVR (Misu et al., 31 Mar 2026). Its ChangeCase example shows a verifier-accepted specification whose precondition is over-constrained and whose postcondition is under-constrained, illustrating that acceptance by OpenJML can coexist with low PreCorr and PostComp (Misu et al., 31 Mar 2026).

Evaluation papers reinforce the same point with stronger adversarial infrastructure. VeriScale shows that positive tests alone overestimate SpecGen quality, and that adversarial unexpected outputs derived from specification gaming expose underspecification that random perturbations miss (Bai et al., 21 May 2026). Verus-SpecGym reports that LLM-as-a-judge evaluation missed 49 of 191 incorrect specs, or 25.7%, in compile-clean cases (Agarwal et al., 26 May 2026). VeriContest adds Post2Exe as a quality-assurance layer for postcondition completeness and reports that this process found and repaired 60 incomplete postconditions (Xie et al., 8 May 2026).

The failure modes themselves are consistent across domains. SpecLoop identifies reset polarity, count limits, handshakes, precedence rules, compile errors, and bounded-proof inconclusiveness as common failure sources (Chang et al., 3 Mar 2026). Verus-SpecGym emphasizes omitted input assumptions, incorrect outputs accepted by weak postconditions, and correct outputs rejected by overly complex or overly narrow postconditions (Agarwal et al., 26 May 2026). MSG highlights impure or undefined spec functions, non-linear arithmetic, and overfitting risks in Move specs (Fu et al., 29 Sep 2025). The neuro-symbolic C framework notes that passing many examples is not a proof of correctness, even though one successful refutation is definitive (Zhang et al., 12 Mar 2026). The broader controversy is therefore not whether specifications can be generated automatically, but how to measure when a generated specification is strong enough, complete enough, and faithful enough to justify downstream proof or synthesis.

6. Research directions and broader significance

Recent work suggests that VeriSpecGen is evolving along three axes: stronger feedback, stronger evaluation, and stronger training data. The first axis is visible in the shift from pass/fail verification to localized diagnostic feedback, counterexample traces, traceability maps, variant discrimination, and executable specification evaluation (Ye et al., 12 Apr 2026). The second axis is visible in benchmark design: VeriScale strengthens negative tests through adversarial implementations, VeriContest couples specification quality to both formal equivalence and negative-test completeness, and Verus-SpecGym uses official tests plus Codeforces hacks to evaluate soundness and completeness without relying on expert gold specs for every task (Bai et al., 21 May 2026).

The third axis is the use of refinement trajectories as supervision. The Lean-based VeriSpecGen framework distills 343,827 trajectory-derived examples and shows that process-level supervision improves not only specification synthesis but also broader reasoning and coding tasks (Ye et al., 12 Apr 2026). EvoSyn extends this logic from specification refinement to synthetic verifiable data construction: it jointly synthesizes problems, candidate solutions, and executable verification artifacts, and uses a consistency-based evaluator plus Zero-Variance Pruning to assemble non-trivial verifiable training instances for RLVR and distillation (Du et al., 20 Oct 2025). This suggests a plausible convergence between specification synthesis, benchmark construction, and verifiable-data generation.

Open limitations are also stable across the literature. Many systems remain restricted by bounded reasoning, specific languages, or single-module settings. SpecLoop currently uses bounded sequential equivalence with depth $S=\langle pre(\vec{x}), post(\vec{x}, y)\rangle$ 2 and targets synchronous single-clock RTL (Chang et al., 3 Mar 2026). SpecSyn focuses on C and ACSL, with TCE serving as a lightweight equivalence heuristic (Ma et al., 23 Apr 2026). MetaCP currently supports correctness in ProVerif and executability in Tamarin, with secrecy and authentication planned rather than fully automated (Metere et al., 2021). VeriContest and Verus-SpecGym remain centered on competitive-programming tasks rather than repository-scale specifications (Xie et al., 8 May 2026). Accordingly, VeriSpecGen is best viewed as an emerging verification methodology rather than a solved problem: it has already established that specification synthesis can be materially improved by formal feedback, but it has also shown that measuring semantic adequacy remains at least as difficult as generating syntactically verifiable contracts.