VeriStruct: Structural Verification Paradigm
- VeriStruct is a structure-centered verification paradigm that externalizes latent assumptions into explicit objects like dependency graphs and score vectors.
- It unifies techniques across AI-assisted verification, structured reward modeling, transportable differentiable structure learning, and high-assurance compilation.
- The paradigm enhances granularity, efficiency, and correctness by grounding proofs, optimization, and execution in clearly defined structural representations.
VeriStruct denotes a structure-centered verification paradigm in which correctness is attached to an explicit internal representation—such as sub-question decompositions, adjacency matrices, bounded data-structure layouts, dependency graphs, or module-level abstractions—rather than to a single undifferentiated output check. In the recent literature, the name appears directly for AI-assisted verification of data-structure modules in Verus (Sun et al., 28 Oct 2025), and it is also used descriptively for multimodal reward modeling and EDA code generation frameworks that make structure explicit before optimization or execution (Zhang et al., 7 Aug 2025, Jayasuriya et al., 20 Apr 2026). A plausible synthesis is that VeriStruct is not one canonical algorithm, but a family of techniques that convert latent structural assumptions into explicit objects of verification, alignment, or compilation.
1. Terminological scope and family resemblance
The current literature uses “VeriStruct” in more than one technically distinct sense. In one usage, it is the explicit name of a Verus framework for AI-assisted verification of Rust data-structure modules (Sun et al., 28 Oct 2025). In another, StructVRM is described as a “VeriStruct”-style idea because it replaces whole-answer binary verification with structured, sub-question-level scoring that is explicit, learnable, and rewardable (Zhang et al., 7 Aug 2025). In a further usage, the phrase is applied to EDA code generation as verification before execution over an explicit structural dependency graph (Jayasuriya et al., 20 Apr 2026). Closely related work on Verilog generation is framed as a strong candidate for what a “VeriStruct” system would mean in hardware synthesis, namely a structure-aware retrieval-and-prompting pipeline grounded in circuit graphs (Zhao et al., 27 Sep 2025).
Across these usages, the recurring pattern is that a structural object serves as an execution contract, proof interface, or reward carrier. The object may be a score vector over subparts, a consensus adjacency matrix, an array-backed stobj representation, a dependency graph over design objects, or a mathematical view of a module. What changes from paper to paper is the operational role of that structure: in some cases it densifies reinforcement signals, in others it enforces transportability, enables verifying compilation, or localizes repair.
A common misconception is to treat VeriStruct as a single fixed framework. The literature instead suggests a broader methodological category: explicit structure is extracted or imposed, verified against domain constraints, and then propagated into optimization, execution, or proof generation.
2. Structured and verifiable rewards in multimodal reasoning
StructVRM addresses a failure mode of vision-LLMs on complex, multi-question reasoning tasks: conventional reward mechanisms produce a single binary score for an entire response, which is too coarse when partial correctness matters (Zhang et al., 7 Aug 2025). The paper motivates the method by observing that about 62.88% of questions are directly verifiable, whereas 37.12% are hard-to-verify or non-verifiable, including multi-part fill-in-the-blank, complex reasoning, and open-ended questions. In that regime, whole-answer binary scoring assigns the same zero reward to a completely failed solution and to one that solved most subparts correctly.
The architecture has two components: a multimodal policy model and a model-based verifier. The policy is Seed-StructVRM, trained first by supervised fine-tuning and then by PPO-based reinforcement learning. The verifier is trained on more than 200,000 annotated examples and outputs a structured score rather than a single scalar. Verifier training data are constructed by having multiple VLMs generate candidate answers and then using a strong internal LLM to grade them according to a strict rubric that judges each sub-question independently. The grader inspects only the final answer, not intermediate reasoning, in order to reduce hallucination and label noise.
The verifier output is a score vector
where is the prediction, is the reference answer, and each corresponds to the correctness of a sub-question or blank. Each element is binary, but the vector preserves the original decomposition. In downstream RL, the structured output is collapsed into a reward by averaging:
and, for parsed verifier outputs, reward extraction is also described as
This mechanism operationalizes partial credit: a model that solves three out of four subparts receives a nonzero reward rather than the same zero reward as a fully incorrect response.
A defining feature of the verifier is that it accepts semantic and mathematical equivalence rather than relying on rigid string matching. The rubric awards 1 point if the final answer matches the reference answer, is mathematically equivalent, or is semantically equivalent, and 0 otherwise. This is especially important for open-ended multimodal STEM tasks, where equivalent algebraic forms, reordered expressions, or paraphrased verbal answers are common.
The training pipeline begins with supervised fine-tuning on over 50,000 high-quality multimodal problems with detailed chain-of-thought traces. These traces are generated from multiple internal models, filtered for quality, and selected to encourage long, structured reasoning. The RL stage uses rule-based rewards for cleanly verifiable tasks and the learned verifier for structured, open-ended, or weakly verifiable tasks. The method also uses different KL coefficients for general and verifiable prompts, with a small KL penalty for general prompts and zero for verifiable prompts. Data construction reinforces the same philosophy: “Choice → True/False” converts one multiple-choice item into several independently judgeable statements, aligning the task with sub-question scoring and discouraging shortcut exploitation.
Empirically, Seed-StructVRM is reported as state-of-the-art on 6 out of 12 public multimodal benchmarks: VLM2 Bench, Zerobench (sub), ScienceQA, CMMMU, MME Realworld-en, and RealworldQA. Reported scores include 69.8 on VLM2 Bench, 95.1 on ScienceQA, 73.2 on CMMMU, and 81.6 on RealworldQA. On the newly curated STEM-Bench, the model achieves 79.23 overall, with 72.11 in physics, 77.11 in chemistry, 81.56 in biology, and 86.15 in math, leading in physics, chemistry, and biology and placing second in math. The chemistry free-form question result, 41.22 versus 23.78 for a strong baseline, is particularly consistent with the claim that structured verification is beneficial for multi-part, open-ended scientific reasoning. Ablation results further show that removing StructVRM reduces STEM-Bench performance from 79.23 to 76.66, and removing RL reduces it to 75.47.
3. Transportable differentiable structure learning
In differentiable DAG discovery, the structurally analogous problem is not sub-question credit assignment but transportability across datasets from the same domain. D-Struct studies DAG structure learning over a domain , where one seeks a graph from observational data, and emphasizes that the number of possible DAGs grows super-exponentially; for 10 variables there are more than possible DAGs (Berrevoets et al., 2022). The paper’s central criticism of NOTEARS is that it makes DAG learning differentiable but does not ensure that datasets from the same variable domain yield the same recovered structure.
Transportability is defined over multiple datasets
with the requirement that
0
This property matters in scenarios such as the paper’s hospital example, where marginal distributions may differ across sites although the underlying biological graph is presumed to be the same. NOTEARS optimizes
1
with smooth acyclicity constraint
2
Because each dataset is optimized independently, and because gradient-based training can converge to different local minima under different samples or initializations, the method provides no architectural mechanism that forces agreement across datasets.
D-Struct is presented as the first transportable differentiable structure learner. It runs 3 parallel DSFs
4
coupled by an explicit transportability regularizer. For each learner, there is a local DSF loss together with an MSE alignment term
5
The full objective is
6
The alignment term is zero only when all learned adjacency matrices agree, which is the formal mechanism by which transportability is encouraged.
A notable design constraint is that differentiability is preserved. D-Struct retains the smooth acyclicity constraint, uses the differentiable alignment penalty 7, and detaches the mean adjacency 8 when computing gradients before backpropagating through each parallel DSF. The paper gives NOTEARS-MLP as a concrete implementation, trained with augmented-Lagrangian or dual-ascent optimization, and notes that prior knowledge about zero edges or known independencies can be enforced via bounds in L-BFGS-B.
The framework is also adapted to a single-dataset setting by constructing 9 subsets 0 intended to mimic distinct distributions. The subsampling routine sorts or reindexes the data so indices correlate with covariate values, defines 1 one-dimensional beta distributions over indices, and samples indices using Bernoulli draws from normalized beta PDFs. This targeted subsampling is reported to outperform random splitting.
Evaluation is conducted on synthetic Erdős–Rényi and Scale-Free graphs using SHD, FPR, TPR, and FDR. For transportability specifically, the paper measures SHD between structures learned from different subsets or datasets. D-Struct is reported to produce much more consistent graphs than vanilla NOTEARS, often achieving SHD = 0 between subgraphs in the ER transportability experiment. It also improves SHD, FPR, TPR, and FDR over NOTEARS-MLP and NOTEARS-SOB. Despite running multiple learners, it is reported to be faster in time-to-convergence, often up to about 20× faster, with an appendix average speedup around 10×. The paper is explicit, however, that this improves structure learning, not causal identifiability by itself.
4. Verifying compilation for high-assurance data structures
A much earlier, and more classically formal, line of work uses VeriStruct in the context of high-assurance systems design with ACL2. The problem is the mismatch between correctness proofs over high-level, unbounded, functional data structures and the bounded, array-based implementations required by autonomous and semi-autonomous systems under strict time and space budgets (Hardin et al., 2018). The paper emphasizes that autonomy algorithms such as route planning, pattern matching, and inference depend on graphs and algebraic data types, while accreditation-oriented design rules discourage dynamic memory allocation and require predictable finite-time execution.
The proposed solution is a verifying compilation technique that preserves the “natural” functional proof style while targeting more efficient implementations. ACL2 is used because it supports reasoning about aggregate data structures of arbitrary size, efficient execution of formal specifications, single-threaded objects (stobjs), automated inductive proving, termination checking, and a guard-based discipline. The central idea is to reason about a logical functional object while implementing it with destructive updates “under the hood,” thereby avoiding repeated copying and garbage generation.
Compilation proceeds by a source-to-source translation from a high-level language with algebraic datatypes and graph types into an array-based representation. The conceptual sequence is: prove properties of the high-level logical function, prove that the compiler preserves semantics during lowering, and then transport the correctness result to the compiled code. This organization is what allows proofs to remain recursive and datatype-oriented rather than being expressed directly over mutable arrays.
A key mechanism is decompilation into logic, inspired by Myreen. The paper gives a theorem schema connecting imperative execution to a pure footprint function 2, so that if code execution from state 3 reaches state 4, then the outputs 5 and the final state is the initial state updated at the designated outputs. The footprint function is automatically synthesized from the code, and the decompilation theorem is proved automatically, assuming termination. The same mechanism is used to turn a specification into a theorem obligation, so that properties can be reasoned about outside the operational semantics.
The target representation is a bounded, flat, array-based layout. For a graph with maximum 6 vertices and maximum 7 outgoing edges per vertex, the store is represented as
8
where 9 is a vertex array of length 0, 1 is a per-vertex data array, 2 is an edge array of length 3, 4 stores edge weights or labels, 5 and 6 are head and tail indices, and 7 is the number of non-zero vertices. A notable invariant is that 0 denotes null for both vertex and edge indices, and the zeroth entry in each array is explicitly allocated and proven invariant under mutation.
The paper gives concrete ACL2 realizations. A binary search tree is implemented as a stobj with arrays and counters, and the lookup routine getVal is recursive but augmented with a count parameter to satisfy ACL2’s termination discipline. A macro restores a convenient external interface. Graphs are also treated as first-class datatypes via
8
together with a bounded declaration
9
A depth-first search example maintains a graph G, a binary search tree spanning_tree that serves both as a parent map and a visited-set test, and a fringe list of (vertex, predecessor) pairs.
The methodological significance is not merely that proofs exist, but that they attach to efficient bounded implementations suitable for code generation to mainstream languages, GPU execution, and hardware synthesis. The paper states that the array-based form scales to millions of vertices and can handle tens of edges per vertex, which situates VeriStruct in a high-assurance compilation lineage rather than only in contemporary LLM-assisted verification.
5. Pre-execution structural verification in EDA code generation
In LLM-based EDA automation, the central observation is that failures are usually structural rather than syntactic. A generated program may be valid Python yet still violate the design hierarchy, miss prerequisites, or invoke APIs on the wrong receiver types. The proposed VeriStruct framework represents each task as a structural dependency graph
0
where nodes represent typed design objects, intermediate targets, conditions, and actions, while edges encode acquisition relations, dependency relations, and valid traversal steps over the design hierarchy (Jayasuriya et al., 20 Apr 2026).
The graph acts as an explicit execution contract. It is built in two phases: structured LLM extraction proposes a candidate graph with typed acquisition paths and an action node, and deterministic validation checks the graph against a structured OpenROAD API schema. Validation determines whether nodes are real, missing-but-real, or hallucinated, and whether edges correspond to valid transitions. Violations are converted into structured feedback used to refine the graph.
The synthesis framework has four stages: graph construction from the prompt; graph-conditioned retrieval and constrained generation; staged pre-execution verification with diagnosis-driven repair; and extension to multi-step execution with trajectory-level reflection. Retrieval is localized to graph edges or substructures rather than to the prompt as an undifferentiated text query. This fetches API usage patterns, code fragments, and evidence relevant to a specific dependency, which reduces irrelevant context and recovers missing intermediate operations.
Verification is organized as a four-layer pipeline with failure layer
1
where 2 means that all layers pass. Layer 1 performs syntax checking via AST parsing. Layer 2 checks causal flow, including acquisition order, valid attribute access, and nullable returns. Layer 3 checks API alignment, including hallucinated calls and wrong receiver or object usage. Layer 4 is semantic and uses an LLM judge to evaluate completeness, control flow, and output validity; it is only applied after layers 1–3 succeed.
Repair is diagnosis-driven rather than tool-driven. The verifier returns a structured diagnostic report describing the failure stage, issues, and localized inconsistencies. The controller may choose regeneration with targeted hints, edge re-retrieval, graph re-extraction, or acceptance if minor issues remain but semantic validity is sufficient. A deterministic safeguard prevents looping on ineffective repairs. The design principle is that all of this occurs before calling the tool, so the framework requires exactly one OpenROAD execution per task.
The paper also introduces uncertainty scoring for verifier-pass programs:
3
Code-level uncertainty is derived from hallucination signals such as method hallucination, invalid imports, and enum hallucination; trajectory-level uncertainty aggregates convergence uncertainty, stagnation, and repair ineffectiveness; verification-coverage uncertainty is
4
This reflects the paper’s view that a verifier pass can be misleading if important calls were never actually checked.
The empirical results are framed against tool-in-the-loop debugging. On single-step tasks, pass rate improves from 73.0% for LLM+RAG and 76.0% for tool-in-the-loop to 82.5% for the full VeriStruct pipeline. Tool usage falls to 1.00 calls/task, compared with 1.77 for tool-in-the-loop and 3.54 for OpenROAD Agent; total tool calls are 120, 248, and 496, respectively, which the paper summarizes as a reduction of more than 2× relative to the tool-in-the-loop baseline. Latency is 34.8 s for VeriStruct versus 53.0 s and 70.0 s for the two baselines. On multi-step tasks, pass rate rises from 30.0% for LLM+RAG to 70.0%, and then to 84.0% with trajectory-level reflection. Uncertainty-aware filtering reduces verifier false positives from 20.0% to 6.7% and improves precision from 80.0% to 93.3%.
6. AI-assisted verification of data-structure modules in Verus
The most literal contemporary use of the name is “VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus” (Sun et al., 28 Oct 2025). The framework extends AI-assisted verification from single functions to full data-structure modules, taking as input Rust source code and a unit test suite and automatically generating the annotations needed by Verus: abstractions or views, type invariants, specifications, and proof code.
Its outer workflow is explicitly two-stage. VerifyModule(code, test) first invokes the generation stage GenAnnos, then the repair stage RepairAnnos. In generation, a planner analyzes the task and decides which annotation-generation modules to execute. The four available modules are M1: View module, M2: Type invariant module, M3: Specifications module, and M4: Proof blocks module. The typical sequence is
5
but the planner may skip modules. The planner guidance rules are specific: invoke 6 if the data structure can be represented by a sequence, set, or map; invoke 7 when there are non-trivial relationships among fields; and if 8 is invoked, also execute 9 so type invariants are available in the proof context.
The View module constructs a mathematical abstraction suitable for reasoning in Verus. For the ring buffer example, the successful abstraction is a pair consisting of a Seq<T> of logical contents and the capacity, expressed through type V = (Seq<T>, usize). The paper explicitly contrasts this with an overly concrete but syntactically valid view such as (Seq<T>, nat, nat), which exposes implementation details like head and tail rather than abstracting them away. To address that failure mode, VeriStruct includes a view refinement step that asks the model to synthesize a more abstract, minimal, and logically coherent view.
The Type invariant module generates properties that must hold for all objects of the data structure. In the ring buffer case, the invariant ensures that head and tail are in bounds and that the ring length is positive. When type invariants are present, proofs may require use_type_invariant(&self); so those facts are available in the proof context. The Specifications module generates spec functions together with requires and ensures, and it explicitly instructs the model to use old(...) when referring to pre-state values. The Proof blocks module constructs proof blocks, loop invariants, and supporting lemma invocations, while also injecting type invariants when required.
Repair is central because LLMs often misunderstand Verus annotation syntax and verification-specific semantics. The paper gives a repair loop that repeatedly calls the verifier, inspects the resulting error, selects a repair module by pattern, and resamples until verification succeeds or the repair budget is exhausted. The repair modules cover misuse between specification and executable functions, inconsistencies with Rust mutability, precondition and postcondition violations, arithmetic overflow or underflow, logical-versus-Rust type mismatches, test assertion failures, and a fallback default module.
A representative semantic error is calling an executable function inside an annotation. The paper’s example uses old(self).is_full() in an ensures clause, which Verus rejects because executable functions cannot be called from annotations. VeriStruct addresses this with a mode-repair module that may synthesize a specification-only function such as
2
For failing test assertions, the framework goes beyond local syntax repair and performs a form of interprocedural analysis: it identifies the method immediately preceding the failing assertion and strengthens that method’s postcondition so the test becomes provable.
The system embeds syntax guidance directly into prompts, drawing on the Verus tutorial and standard library documentation. Prompts are structured into blocks covering objective, Verus guidelines, step-by-step instructions, examples, code, and tests. Sampling is used throughout, with 0 samples per module invocation and the sample with the maximum number of verified functions selected; the repair budget is 1 iterations.
Evaluation covers 11 Rust data-structure modules and 129 functions in total. The reported result is that VeriStruct solves 10 out of 11 modules and verifies 128 out of 129 functions, i.e. 99.2%. The baseline solves 4 modules and verifies 52 functions. The only unsolved benchmark is Node, although even there VeriStruct verifies 11 out of 12 functions. A notable qualitative observation is the Bitmap case, where the system discovers a simpler single-array abstraction rather than the human-written two-dimensional block-based abstraction, indicating that automated abstraction synthesis need not reproduce the original proof design to be correct.
7. Related extensions, misconceptions, and broader significance
A related extension appears in VeriGRAG, which targets LLM-based Verilog generation by explicitly modeling hardware structure rather than treating Verilog as plain text (Zhao et al., 27 Sep 2025). The framework extracts data-path graphs from Verilog using Yosys, encodes them with GINEConv, retrieves structurally relevant graphs via a multimodal retriever trained with knowledge distillation, and converts the retrieved graph embeddings into structure-aware soft prompts through VeriFormer. Training data come from OriGen and PyraNet, cleaned to 276,627 examples, and evaluation is conducted on VerilogEval and RTLLM. Reported results include, for VeriGRAG-Qwen2.5-Coder-14B, 82.9 pass@1 and 89.4 pass@5 on VerilogEval-Machine, 59.7 pass@1 and 68.6 pass@5 on VerilogEval-Human, and 62.2 pass@1 and 69.5 pass@5 on VerilogEval v2 Function. The paper’s own framing is that this is close to what a VeriStruct system would mean for hardware code generation: structure is represented as graphs, retrieved as graphs, and injected into generation as learned prompts.
Taken together, these lines of work indicate that VeriStruct should not be reduced to any one of three common but narrow interpretations. It is not only a formal-methods term, because in StructVRM the central object is a structured reward signal used in PPO rather than a proof artifact. It is not only a code-generation term, because in D-Struct the main concern is consensus over adjacency matrices across datasets, not synthesis. Nor is it equivalent to exact matching or brittle rule checking, because several of the systems are explicitly motivated by the failure of coarse binary or syntactic verification and instead accept semantic equivalence, mathematical equivalence, or graph-level consistency.
The broader significance of the family lies in a shared methodological move: hidden structure is externalized into an explicit, machine-checkable object, and that object mediates learning, inference, compilation, or proof. In multimodal reasoning, this yields denser rewards and better credit assignment. In differentiable DAG discovery, it restores transportability without abandoning differentiability. In ACL2-based high-assurance compilation, it links functional proofs to bounded array implementations. In EDA, it decouples correctness from repeated tool invocation by enforcing dependency contracts before execution. In Verus, it organizes module-level abstraction, invariant synthesis, and repair. This suggests a general research direction in which structural representations are not mere auxiliary metadata but first-class verification interfaces.