VeriMaAS: Verified Multi-Agent Orchestration
- VeriMaAS is a framework that coordinates LLM-based agents and formal verification modules to tackle complex queries and automated RTL code generation.
- It employs a Plan–Execute–Verify–Replan loop to decompose tasks, execute them in parallel, and adapt workflows based on verification feedback.
- Empirical results show significant improvements in answer completeness and synthesis correctness compared to single-agent and static pipeline methods.
VeriMaAS (Verified Multi-Agent Orchestration System) designates a class of multi-agent frameworks that coordinate both domain-specialized LLM agents and formal verification agents to solve complex queries or synthesis tasks. Distinct implementations include general complex-query orchestration via verification-driven iterative cycles and automated RTL code-generation with integrated EDA feedback. Across both domains, VeriMaAS applies orchestration-layer or formal-verification feedback as a principal coordination signal, enabling adaptive workflow composition, resource-aware execution, and increased assurance of result quality. The framework has demonstrated substantial improvements in answer completeness and synthesis correctness relative to canonical single-agent and pipeline baselines (Zhang et al., 12 Mar 2026, Bhattaram et al., 24 Sep 2025).
1. Architectural Foundations
At its core, VeriMaAS employs a Plan–Execute–Verify–Replan loop (PEVR) for orchestrating specialized LLM-based agents, optionally enhanced by domain-specific formal-verification agents. This iterative architecture comprises discrete phases:
- Plan: Decomposition of an input query into a structured set of sub-questions or tasks, commonly represented as a directed acyclic graph (DAG) , or a linear cascade in code-generation settings.
- Execute: Parallel or sequential execution of agent-assigned sub-tasks, subject to explicit dependency constraints and resource budgets.
- Verify: Independent verification of outputs, leveraging either additional LLMs for qualitative/traceability assessment (complex querying) or formal hardware design tools such as Yosys and OpenSTA for concrete pass/fail, property, and PPA metrics (RTL synthesis).
- Replan: Adaptive workflow modification—adding coverage tasks, retrying failed nodes, or transitioning to more sophisticated reasoning agents—driven by verification-stage feedback.
- Synthesize (when applicable): Hierarchical aggregation of results into a final answer or candidate pool.
These stages are governed by orchestration-level stopping conditions, balancing answer quality, computational cost, and resource limits (Zhang et al., 12 Mar 2026, Bhattaram et al., 24 Sep 2025).
2. Orchestration Logic and Query Decomposition
VeriMaAS query planners formalize decomposition, assigning sub-questions with detailed annotations: agent type, priority , a context-enrichment flag, and explicit verification criteria (Zhang et al., 12 Mar 2026). Formally, the decomposition yields:
- A DAG where
- : sub-questions
- : dependency edges, iff requires output from
- Topological ordering , maintaining strict dependency (Zhang et al., 12 Mar 2026).
In RTL design synthesis, a linear cascade of “reasoning operators” (e.g., IO, CoT, ReAct, Self-Refine, Debate) sequentially applies increasingly sophisticated code-generation strategies, while a controller agent selects the current operator and triggers transitions based on formal-verification failure rates (Bhattaram et al., 24 Sep 2025).
3. Execution and Verification Modules
Dependency-Aware Parallel Execution
The executor launches up to 0 concurrently ready questions (with all dependencies satisfied), defaulting to 1 (Zhang et al., 12 Mar 2026). Each sub-result is optionally context-enriched by its dependencies. In the RTL synthesis instantiation, LLM-based operators generate pools of candidates per stage, which are then subjected to batch formal verification.
LLM-Based and Formal Verification
Verification agents assess results differently across applications:
- Complex Queries: An LLM-based verifier scores each output for completeness (2), source quality (3), flags missing aspects, detects contradictions, and issues a recommendation (accept/retry/escalate). The orchestration-level completeness and quality signals are: 4 aggregated for adaptive replanning (Zhang et al., 12 Mar 2026).
- RTL Synthesis: Formal-verification agents (Yosys, OpenSTA) return compile pass/fail flags (5), PPA metrics (6), and error logs (7). The controller aggregates stage-wise failure rates (8) and determines whether to escalate to more complex reasoning operators based on thresholds 9 (Bhattaram et al., 24 Sep 2025).
4. Adaptive Replanning, Control, and Stopping Criteria
When verification signals uncover gaps (e.g., 0, negative recommendations, or high failure rates), VeriMaAS modifies the workflow:
- Complex Query Orchestration: The AdaptiveReplanner injects new sub-questions for flagged missing aspects, retries incomplete nodes, or expands the DAG as required. Previous partial results are preserved and integrated (Zhang et al., 12 Mar 2026).
- RTL Design Synthesis: The controller escalates the operator cascade to more sophisticated strategies or halts and returns the pooled candidates based on failure thresholds (Bhattaram et al., 24 Sep 2025).
Termination is triggered by configurable stop conditions, encompassing synthesis readiness, high-confidence output, diminishing quality improvements, token budget limits, or maximum iteration bounds (Zhang et al., 12 Mar 2026).
5. Quantitative Outcomes and Quality Analysis
Empirical evaluation spans two primary settings:
Complex Query Orchestration
- Benchmarks on 25 expert-curated market research queries (Performance Analysis, Competitive Intelligence, Financial Investigation, Strategic Assessment) (Zhang et al., 12 Mar 2026).
- Compared methods:
- Single-Agent: Monolithic reasoning and tool use
- Static Pipeline: Fixed agent sequence (RAG→Web→Financial→Analysis→Synthesis)
- VMAO (VeriMaAS): Dynamic DAG, verification-driven replanning, parallel execution
| Method | Completeness | Source Quality | Avg Tokens | Avg Time (s) |
|---|---|---|---|---|
| Single-Agent | 3.1 | 2.6 | 100K | 165 |
| Static Pipeline | 3.5 | 3.2 | 350K | 420 |
| VMAO (VeriMaAS) | 4.2 | 4.1 | 850K | 900 |
VMAO achieves +35% completeness and +58% source quality over the Single-Agent baseline. Token usage is higher (×8.5), interpreted as justified by the quality improvement. On Strategic Assessment, completeness improved by +53%.
RTL Synthesis
- Gains (absolute pass@k improvement) over strong LLM baselines across multiple models (Bhattaram et al., 24 Sep 2025):
| Model | pass@1 Gain | pass@10 Gain |
|---|---|---|
| GPT-4o-mini | +2.45 | +1.98 |
| o4-mini | +0.24 | +0.29 |
| Qwen2.5-7B | +11.72 | +3.96 |
| Qwen2.5-14B | +6.35 | +1.65 |
| Qwen3-14B | +2.81 | +0.11 |
Average uplift: ≈6pp (pass@1), ≈2pp (pass@10). Token-cost overhead is ≤1.1× a single CoT pass, and below that of Self-Refine strategies (2–3×).
Multi-Objective Optimization
- Retuning cost terms to EDA-derived metrics (e.g., Yosys area) enables PPA-aware optimization:
- Area reduction: up to 28.8%
- Delay reduction: up to 21.4%
- pass@10 negligible change (≤1pp)
Supervision cost for controller tuning: order-of-magnitude less than full LLM fine-tuning (|D_tune| = 500 vs. |D_ft| ≈ 5,000–50,000).
6. Systemic Limitations and Trade-offs
Documented limitations include:
- Dependence on open-source EDA (Yosys/OpenSTA, SkyWater 130 nm PDK). Behavior under commercial flows is unexamined.
- Controller policies are static cascades; non-cascaded, adaptive, or RL-based policies may further optimize cost and accuracy.
- Multi-objective control (area vs. timing vs. power/accuracy) is unresolved in RTL settings; trade-offs (e.g., power increases on area minimization) are observed.
- Tuning set size restricts generalization to more complex or larger-scale tasks.
No formal statistical significance testing is reported, though consistent improvements and small human–LLM score drift (±0.5 on <15% of items) are observed (Zhang et al., 12 Mar 2026, Bhattaram et al., 24 Sep 2025).
7. Application Scope and Future Directions
VeriMaAS has established efficacy in both complex expert querying (market research, financial analysis, intelligence, strategy) and hardware synthesis tasks (RTL generation). Its foundational principle—propagating formal or LLM-based verification feedback to inform multi-agent coordination—underpins empirical quality gains, especially in open-ended or correctness-critical tasks.
Future work may focus on adaptive controller policies (e.g., RL, tree search), integration with commercial EDA, and extension to multi-objective or full ASIC-scale design settings (Bhattaram et al., 24 Sep 2025). Expansion of training/tuning datasets is likely required for broader generalization.
VeriMaAS frameworks substantiate that explicit verification modules, orchestrator-driven replanning, and configurable stopping logic collectively provide a scalable route to high-completion, traceable, and formally correct outputs from multi-agent LLM systems. This yields superior empirical performance at the cost of increased but manageable compute and annotation overhead (Zhang et al., 12 Mar 2026, Bhattaram et al., 24 Sep 2025).