AutoVeriFix: Python-Assisted RTL Correction
- AutoVeriFix is a two-stage framework that uses Python-driven reference models to define precise circuit behavior and guide Verilog correction.
- It employs automated simulation and coverage-driven testbenches to iteratively detect, diagnose, and repair syntactic and functional errors.
- Empirical results show pass@10 rates up to 90.2% with false positive rates below 9%, outperforming existing baseline models.
AutoVeriFix is a Python-assisted, two-stage framework designed to enhance the functional correctness of Verilog code automatically generated by LLMs. The approach addresses common deficiencies in standard LLM generation, particularly the prevalence of functional errors due to limited high-quality Verilog training data. AutoVeriFix utilizes the strengths of LLMs in Python code synthesis to define precise circuit behavior and then iteratively detects, diagnoses, and corrects errors in the Verilog implementation by leveraging automated simulation-based feedback. This systematic methodology yields significant improvements over existing techniques in terms of functional correctness benchmarks, positioning it as a robust solution for code validation and repair in hardware synthesis contexts (Tan et al., 10 Sep 2025).
1. Two-Stage Python-Assisted Workflow
AutoVeriFix operates in two principle stages:
- Stage 1: An LLM is tasked with generating a high-level Python reference model from hardware specifications. Python, due to extensive LLM exposure and a rich corpus, serves as an effective intermediate language. These models are nearly functionally correct and act as behavioral oracles.
- Stage 2: A separate LLM generates Verilog RTL code for the same hardware specification. The correctness of this RTL code is assessed by automated simulation against the outputs of the Python model.
Discrepancies uncovered in simulation—whether due to syntactic faults or functional mismatches—are returned as structured feedback to the LLM, which then revises and regenerates the Verilog code. This loop continues until the Verilog passes all tests derived from the reference model, improving both syntactic integrity and semantic fidelity.
2. Role of Python Reference Models
The Python reference model underpins the framework:
- Generated from hardware descriptions, it encodes the circuit's intended state transitions, combinational logic, and output behavior.
- The model is executed with a set of autogenerated test inputs (“testbench”), simulating all relevant states and transitions.
- Its outputs function as the “golden reference”: every Verilog output generated must match the model's output for identical inputs.
The abundance of Python training data makes it feasible for LLMs to produce functionally accurate models, which in turn enables highly targeted testing and error localization for the more challenging Verilog synthesis step.
3. Automated Test Generation and Coverage-Driven Refinement
Testing is integral to bridging Python models and Verilog implementations:
- Initial test inputs stimulate the Python model, exercising known behavioral pathways.
- Coverage metrics (both line and branch coverage) are computed after execution; a threshold (typically set at 85%) ensures sufficiently exhaustive testing. If coverage is inadequate, the model is provided with fine-grained feedback (such as specific untested branches), prompting the LLM to adjust or expand the test dataset.
- This iterative process continues until the testbench meets coverage requirements, at which point it is translated into Verilog-compatible stimuli for simulation.
Testbenches generated in this manner rigorously probe both the breadth and depth of circuit functionality, minimizing false positives and ensuring that both subtle and routine errors are caught.
4. Iterative Error Detection and Correction Loop
Error correction proceeds in two distinct but interlinked phases:
- Syntax Debugging: Generated Verilog code undergoes compilation and simulation. If syntax errors are encountered (e.g., missing identifiers, undeclared wires), detailed simulation/compilation messages are returned to the LLM. The code is regenerated iteratively until it is free of syntax errors.
- Functional Debugging: Functionally correct RTL is validated by simulation using the Python-driven testbench. Output mismatches are logged, specifying both the expected and actual behavior. This detailed feedback is iteratively used to refine the Verilog code, directly targeting errant logic or state transition implementations.
This process— Generate, Simulate, Compare, Feedback, Regenerate—minimizes both syntactic and semantic faults in the final code. It exploits the LLM’s capacity for prompt-driven correction and leverages simulation as an automated oracle for correctness.
5. Empirical Performance and Evaluation Metrics
AutoVeriFix’s effectiveness is quantified via standard metrics:
- Functional Correctness (pass@k):
where is the number of generated samples, the number of correct samples, and the evaluation batch size.
- False Positive Rate (FPR):
Experimental results on VerilogEval-human, VerilogEval-machine, RTLLM v1.1/v2.0 show pass@10 rates of 84.6–90.2% (GPT-4) and FPRs below 9% (GPT-4) and 12% (GPT-3.5). AutoVeriFix consistently outperforms existing baseline models (e.g., OriGen, RTLCoder) and generic LLMs, as measured by these metrics.
Benchmark | pass@10 (GPT-4) | FPR (GPT-4) |
---|---|---|
VerilogEval-human | 84.6% | <9% |
VerilogEval-machine | 90.2% | <9% |
RTLLM v2.0 | 83.5% | <9% |
These results indicate both high rate of correct synthesis and low overfitting to incomplete testbenches.
6. Technical Challenges and Future Directions
AutoVeriFix highlights several important limitations and avenues for research:
- Testbench Completeness: The feedback-driven iterative generation improves coverage, but full functional validation may still be elusive. Further research on coverage-driven test synthesis or concolic execution may close remaining gaps.
- Training Data Scarcity: The lack of high-quality Verilog (RTL) data for LLM training inhibits direct synthesis; Python-assisted methods partly compensate but scaling to complex hardware may require dataset expansion or domain adaptation.
- Semantic Bridging: Although Python models serve as reliable behavioral oracles, the semantic gap between Python and synthesizable Verilog represents a source of potential mismatches. Hybrid approaches or intermediate representations could mitigate this.
- Efficiency of Iteration: The error correction loop’s prompt design and simulation methodology could benefit from optimization to reduce superfluous iterations, especially on large designs.
- Complex and Large-Scale Designs: Extending the technique to larger, multi-module RTL designs is a plausible future direction; hierarchical approaches or integration with hardware verification platforms should be explored.
7. Contextual Positioning within Automated Program Repair
AutoVeriFix represents the migration of automated program fixing and repair techniques—originally proposed for software (e.g., AutoFix-E2 (Pei et al., 2011))—into the hardware synthesis domain. By employing automated oracles, feedback-driven correction, and testbench-based validation, it aligns with principles from automated dynamic program repair, but adapts these to the distinct challenges of RTL code generation where formal specification and syntactic diversity are lacking.
As LLM-assisted code synthesis expands into hardware description languages, the AutoVeriFix methodology evidences the practical benefits of Python-guided modeling, comprehensive simulation-based validation, and large-scale error correction loops. Its empirical superiority over prior techniques marks a notable advance in the automation of functional hardware verification.
AutoVeriFix advances the state-of-the-art in automated correction of LLM-generated Verilog, integrating Python-based behavioral modeling, coverage-optimized testbench generation, and iterative diagnosis and repair. Its performance on rigorous benchmarks and low false positive rates, coupled with a robust methodological framework, provide a strong foundation for future work in automated hardware verification and code synthesis.