ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Published 13 May 2026 in cs.MA, cs.AI, cs.AR, and cs.LG | (2605.12857v1)

Abstract: Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed-source APIs incompatible with chip vendors' air-gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases, leaving valuable internal data unused. Recent self-trained models address the deployment constraint but remain single-turn generators that overlook the critical role of verification in real industrial flows. To bridge these gaps, we present ChipMATE, the first self-trained multi-agent framework for RTL generation. Inspired by industrial practice where correctness emerges from cross-comparison between independently written RTL modules and reference models, ChipMATE pairs a Verilog agent with a Python reference-model agent that mutually verify each other's outputs without any golden oracle. We design a backtrack-based inference workflow to prevent error propagation across turns, and a two-stage training pipeline that first trains each agent individually to saturate its code-generation capability, then trains the team jointly to collaborate effectively. To support the training, we further build a hybrid data-generation framework that produces 64.4K high-quality reference model training samples. ChipMATE achieves 75.0\% and 80.1\% pass@1 on VerilogEval V2 with 4B and 9B base models, outperforming all existing self-trained models and even DeepSeek V4 with 1600B parameters. Our code and model weights are publicly available in https://github.com/zhongkaiyu/ChipMATE.

Abstract PDF Upgrade to Chat

Authors (15)

Summary

The paper introduces a dual-agent RL framework that uses a cross-verification loop between a Verilog-design and a Python-verification agent for enhanced RTL generation.
It employs a two-stage training pipeline combining single-agent SFT+RL and multi-agent RL (X-GRPO) with backtracking to ensure robust error correction.
The study demonstrates significant improvements in pass@1 metrics and provides a scalable, industry-aligned strategy for private RTL code deployment.

ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Motivation and Problem Statement

ChipMATE addresses persistent misalignments between existing LLM-based RTL code generation workflows and the operational realities of the semiconductor industry. Prevailing API-based agentic systems rely on golden testbenches for self-correction and closed, third-party APIs. These assumptions conflict with industry practice, in which testbenches are either unavailable at RTL authoring time or are fallible artifacts in their own right. Moreover, intellectual property policies and security requirements (e.g., air-gapping) preclude the use of external APIs and prohibit uploading proprietary RTL corpora for model customization. As a result, current systems miss opportunities to leverage vendors' high-quality, in-house RTL code for specialized model training.

Recent self-trained LLMs partially address deployment issues but typically operate as single-turn generators, lacking any verification loop. This hinders the realization of accurate RTL since even expert engineers depend on iterative, cross-verification against independently developed reference models—a workflow not captured by single-shot methods. ChipMATE proposes a framework that reflects the industry's division of labor, pairing a Verilog-generating "designer" agent with a Python-based "verification" agent to support iterative, mutual refinement without reliance on oracular testbenches.

Core Methodology

Cross-Verification Multi-Agent Workflow

ChipMATE's agentic workflow features a Verilog agent (design role) and a Python agent (verification role) that independently generate implementations from a natural-language specification. Both agents sample multiple candidate solutions per turn, and outputs are cross-verified over randomly generated test input vectors using an aligned semantic comparator. When mismatches occur, structured diagnostics—including natural-language error feedback and selective code history—are injected into the agents' context. Crucially, each agent analyzes this feedback in isolation and proposes code modifications only when self-judged as faulty, never viewing the peer’s code directly.

The workflow incorporates a backtracking mechanism: corrections that fail to improve the match rate are automatically reverted, preventing the propagation and compounding of errors across iterations. This maintains monotonic quality improvement and ensures that multi-turn interaction yields strictly better or equivalent solutions.

Two-Stage Training Pipeline

ChipMATE's agent optimization pipeline is explicitly staged to decouple individual skill acquisition from collaborative protocol learning.

Stage 1: Single-Agent SFT + RL: Each agent is first trained independently using SFT (from domain-specific corpora for Verilog and synthetic data for Python reference models), followed by RL with group-relative policy optimization (GRPO). The curation of RL training data ensures that variance in success/failure (post-SFT pass@10 ∈ [0.1, 0.9]) is maintained, supporting stable credit assignment.
Stage 2: Multi-Agent RL (X-GRPO): Joint training employs a novel extension of GRPO (termed X-GRPO), inspired by tree search (Tree-of-Thought/AT-GRPO), to retain meaningful variance for group-advantage computation in multi-turn, multi-agent trajectories. Each agent samples $K$ candidates; their outputs form a $K \times K$ Cartesian set, and the jointly best-scoring pair in terms of alignment and code correctness is propagated as the prefix for subsequent turns. The RL reward hierarchy incentivizes local improvements, correct bug fixes, and agent agreement, weighted to preserve individual correctness as the dominant signal.

Reference Model Data Generation

A central bottleneck—absence of large-scale reference model datasets for Python behavioral simulation—is overcome via a hybrid synthetic framework:

Agentic LLM Distillation is used to produce verified Python reference models from Verilog designs using a frontier LLM in conjunction with automatic cross-verification and iterative correction. Despite high computational cost and modest yield, this produces ~25k chain-of-thought annotated samples.
IR-Based Deterministic Conversion parses Verilog netlists, lowers them to a normalized IR, and generates functionally equivalent Python, augmented a posteriori with pseudo-chain-of-thought explanations via LLM prompting. This contributes a further ~36k verified samples efficiently.
Category-Specific Augmentation targets systematic failure modes (e.g., FSMs, multi-cycle blocks, bit arithmetic) by supplementing the dataset with task-focused examples to ensure distributional coverage, processed via the IR conversion pipeline.

Experimental Validation

ChipMATE is instantiated on Qwen3.5 models (4B/9B), using open-source infrastructure for SFT and RLHF. Evaluation spans four RTL generation and simulation benchmarks: VerilogEval v2, RTLLM v2, ChipBench-SC, and CVDP cid03.

Strong Results

Verilog Generation: ChipMATE-Agents-4B/9B outperform all prior self-trained open-source models by significant margins (up to 13.6% absolute pass@1 improvement over QiMeng-Code V-R1). ChipMATE-Agents-9B surpasses even much larger API-based models (such as DeepSeek and GPT-4o/5.5) in most major benchmarks, except against the largest next-gen proprietary models (GPT-5.5, Claude Opus), which only outperform ChipMATE-Agents-9B on select tasks.
Reference Model Generation: On Python behavioral modeling, ChipMATE achieves new state-of-the-art performance, with the 9B Python agent exceeding its own Verilog agent’s pass@1 by 5.4–15.2% across all benchmarks. Notably, the strong Verilog results in the cross-verification workflow closely track the Python agent’s upper bound, validating the collaborative protocol.
Self-Correction: The single-agent-to-agentic workflow transition in ChipMATE substantially increases pass@1, with the gap between pass@5 and pass@1 narrowing significantly—indicating effective in-inference error correction, highly desirable for industrial deployment.
Ablation and Workflow Tuning: The introduction of backtracking is essential; omitting it leads to accuracy degradation from compounding errors. Optimal workflow parameters (sampling budget and turn limit) are empirically determined (Best-of-3, T=3).

Implications and Broader Impact

ChipMATE empirically demonstrates that agentic, self-trained multi-agent systems can outperform both single-agent and large, closed-API-based models on sophisticated code generation and simulation tasks relevant to hardware design. By structurally mirroring the real-world division of design and verification in industry, and by enabling private, local deployment and fine-tuning, ChipMATE provides a viable path for semiconductor companies to exploit internal RTL assets without compromising IP security.

The work sets a precedent for LLM workflows that establish correctness not from privileged oracular annotation, but from peer cross-verification and iterative refinement, opening a robust design pattern applicable across other domains with similar dual-role or co-verification requirements.

Practically, open-sourcing both the code and multi-billion-parameter models facilitates widespread adoption, reproducibility, and benchmarking. Theoretically, this research expands the RLHF literature to coordinated, asynchronous multi-agent learning environments with cross-language code-generation objectives.

Future Directions

Several logical extensions arise:

Generalization to Other HDL/Verification Pairs: Extending the framework to VHDL or SystemVerilog, and pairing with SystemC/C++ agents, would both test generality and expand industrial relevance.
Scaling Agent Count and Specialization: Introducing more specialized agents (e.g., for property checking, formal verification, or timing analysis) could further parallel industrial design flows.
Active Learning with Human-in-the-Loop: Combining human designers/verifiers as expert agents in the multi-agent loop could drive further sample efficiency and model robustness, and enhance trust for deployment in safety/security-critical flows.

Conclusion

ChipMATE sets forth a practical and theoretically sound paradigm for LLM-assisted RTL code generation and verification, overcoming longstanding barriers in real-world chip design pipelines. The dual-agent RLHF protocol, underpinned by a backtracking cross-verification mechanism and robust synthetic data generation, delivers state-of-the-art, verifiable performance in both Verilog generation and cycle-accurate Python simulation—demonstrating the viability of self-trained, collaborative LLM agent frameworks for industrial code synthesis and beyond.

Markdown Report Issue