RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation

Published 27 Apr 2026 in cs.SE and cs.AI | (2604.24218v1)

Abstract: As the complexity of System-on-Chip (SoC) designs grows, the shift-left paradigm necessitates the rapid development of high-fidelity reference models (typically written in SystemC) for early architecture exploration and verification. While LLMs show promise in code generation, their application to hardware modeling faces unique challenges: (1) Rigid, static workflows fail to adapt to varying design complexity, causing inefficiency; (2) Context window overflow in multi-turn interactions leads to catastrophic forgetting of critical specifications; and (3) the Coupled Validation Failure problem--where generated Testbenches (TBs) incorrectly validate flawed models due to correlated hallucinations--severely undermines reliability. To address these limitations, we introduce RefEvo, a dynamic multi-agent framework designed for agile and reliable reference modeling. RefEvo features three key innovations: (1) A Dynamic Design Planner that autonomously decomposes design specifications and constructs tailored execution workflows based on semantic complexity; (2) A Co-Evolutionary Verification Mechanism, which employs a Dialectical Arbiter to simultaneously rectify the model and verification logic against the specification (Spec) oracle, effectively mitigating false positives; and (3) A Spec Anchoring Strategy for lossless context compression. Evaluated on a diverse benchmark of 20 hardware modules, RefEvo achieves a 95% pass rate, outperforming static baselines by a large margin. Furthermore, our context optimization reduces token consumption by an average of 71.04%, achieving absolute savings of over 70,000 tokens per session for complex designs while maintaining 100% specification recall.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a multi-agent architecture combining dynamic planning, co-evolutionary verification, and spec anchoring to enhance SystemC model generation reliability.
It demonstrates empirical success with up to 95% pass rates and a remarkable 71% token reduction while maintaining full specification recall.
The approach bridges automated LLM code generation with rigorous hardware verification, reducing engineering overhead and inspiring future SoC validation research.

RefEvo: Agentic Design with Co-Evolutionary Verification for Agile SystemC Reference Model Generation

Motivation and Problem Statement

The increasing complexity in SoC development, coupled with industry demands for “shift-left” methodologies, mandates the rapid, reliable generation of high-fidelity reference models (notably SystemC) for early-stage architectural exploration and verification. Conventional manual workflows are inefficient and error-prone, and while LLMs have shown promise in code generation, their application to hardware modeling reveals significant deficiencies. These include (i) rigid generation workflows that inadequately adapt to varying design complexities, (ii) context overflow resulting in catastrophic forgetting of critical constraints across extended iterative sessions, and (iii) the “Coupled Validation Failure” phenomenon where naïve dual LLM generation of both DUT and TB leads to shared semantic flaws—compromising verification fidelity.

Figure 1: Performance capability of state-of-the-art LLMs using optimized prompt engineering and structured generation workflows reveals significant struggle in generating correct SystemC models, especially in memory and control logic domains.

RefEvo Architecture and Key Innovations

RefEvo introduces a hierarchical, multi-agent framework tailored to the demands of hardware modeling and verification. The logical architecture comprises three core innovations:

Dynamic Design Planner: Agent 1 acts as the central orchestrator, analyzing the specification’s semantic complexity (interface type, state space, concurrency) and leveraging legacy assets for decomposition and workflow selection.
Co-Evolutionary Verification Loop: The parallel generation/verification paradigm is realized through the Modeler (Agent 2) and Verifier (Agent 3), with a Dialectical Arbiter (Agent 4) overseeing verification against an anchored specification oracle. This enables rigorous cross-verification and targeted repair, breaking symmetry in TB/DUT hallucinations.
Spec Anchoring Context Management: To prevent catastrophic forgetting, the context window is partitioned into immutable specification, compressed summary, and dynamic workspace, ensuring full recall and token efficiency.
Figure 2: Logical architecture of RefEvo detailing the dynamic planning phase and the symbiotic, dialectical verification loop.

Experimental Results and Analysis

End-to-End Success Rates

RefEvo demonstrates marked improvement in SystemC model generation success rates, evaluated across 20 module benchmarks and five SOTA LLMs (Gemini-2.5-Pro, GPT-5.1, GPT-4.1, Qwen3, Claude-Opus-4.1). Baseline approaches (Naive, Flow_Only, FixedTB) are consistently outperformed by RefEvo, which achieves up to 95% end-to-end pass rates (Gemini-2.5, GPT-5.1), resolving previously intractable functional failures in complex modules (e.g., fpu_div, keccak).

Figure 3: End-to-end success rate across models and workflow variants; RefEvo outperforms all baselines.

Mechanism Effectiveness and Failure Modes

An ablation on RefEvo’s verification mechanism reveals two critical insights. The transition from static flows to iterative refinement eliminates compile errors but exposes latent functional mismatches. RefEvo’s co-evolutionary loop further reduces functional failure by actively repairing TB logic—a capability absent in FixedTB-mode agents.

Figure 4: Failure distribution analysis highlights the elimination of compile and functional errors in RefEvo via co-evolutionary verification.

Methodological Robustness and Model-Agnostic Performance

Consistent upward pass rate trends across diverse LLMs confirm that RefEvo's improvements are largely independent of base model capability, offering robust augmentation for LLM-aided hardware generation, particularly in high-complexity, high-fidelity scenarios.

Figure 5: Methodological robustness analysis; RefEvo reliably enhances generation quality across different LLMs.

Context Efficiency and Spec Recall

RefEvo’s spec anchoring strategy achieves substantial token savings—averaging a 71.04% reduction—particularly prominent for complex designs where absolute savings exceed 73,900 tokens per session. Importantly, this efficiency is not achieved at the expense of specification recall, which remains at 100%.

Figure 6: Token consumption comparison by design scale; Spec anchoring achieves substantial reduction, scaling with complexity.

Practical and Theoretical Implications

RefEvo addresses fundamental constraints in LLM-aided EDA: it not only automates high-level model and TB generation but also embeds verification rigor traditionally attainable only by skilled human practitioners. The co-evolutionary dialectic between Modeler and Verifier, mediated by Arbiter, systemically resolves coupled validation failures—mitigating a class of false positives prevalent in naïve workflows. Spec anchoring ensures lossless context management, facilitating scalable, long-turn agentic workflows.

Practically, RefEvo enables automated, agile golden model generation for SoC verification, drastically reducing engineering overhead. Theoretically, it advances agentic system design in domains requiring persistent truth propagation, symbiotic multi-agent repair, and dynamic task decomposition.

Future Directions

Potential future research directions include:

Extension to more expressive hardware modeling domains (e.g., TLM+, formal verification).
Incorporation of external toolchains and cross-modal asset reusability leveraging agentic orchestrators.
Scaling to multi-agent ecosystems with expanded dialectical arbitration for heterogeneous verification scenarios.
Integration with retrieval-augmented reasoning frameworks (e.g., ChipMind (Xing et al., 5 Dec 2025)) for even broader context support.

Conclusion

RefEvo systematically bridges the gap between pure LLM code generation and domain-constrained hardware verification. The hierarchical agentic framework—combining dynamic planning, co-evolutionary verification, and spec anchoring context management—demonstrates high efficacy and robustness in SystemC reference model generation, addressing key challenges of semantic complexity, verification reliability, and context scalability. By achieving strong numerical gains in pass rates and token efficiency, RefEvo sets a practical and theoretical foundation for future, fully automated SoC verification workflows (2604.24218).

Markdown Report Issue