ComplexVCoder: LLM-Driven RTL Synthesis
- ComplexVCoder is an LLM-driven framework that systematically generates complex, hierarchical RTL (Verilog) code from concise natural language descriptions using a two-stage process.
- It employs a General Intermediate Representation (GIR) to structure modules, ports, and connections, guiding deterministic rule-based alignment and retrieval-augmented generation (RAG).
- Benchmark results on the ComplexVDB suite demonstrate significant improvements in functional correctness and efficiency over existing state-of-the-art models.
ComplexVCoder is an open-source, LLM-driven framework for the systematic generation of complex register-transfer level (RTL) code, specifically targeting Verilog HDL. Designed to address persistent challenges in auto-generating large-scale, multi-level RTL modules from concise natural language (NL) descriptions, ComplexVCoder introduces a two-stage mechanism centered on a structured intermediate representation, coupled with rule-based alignment and domain-specific retrieval-augmented generation (RAG). The framework is accompanied by a real-world benchmark suite (ComplexVDB) and demonstrates significant performance advantages over state-of-the-art (SOTA) alternatives on intricate Verilog synthesis tasks (Zuo et al., 29 Apr 2025).
1. Motivation and Framework Overview
Large-scale RTL designs typically span hundreds of lines and feature deep module hierarchies and interconnections. Prior LLM-based frameworks predominantly focus on single-module, low-complexity scenarios, resulting in failures when generalizing to hierarchical consistency, signal connectivity, and correct port wiring in multi-level designs. Direct NL-to-Verilog approaches offer limited structural guidance, often leading to functionally incorrect–yet syntactically valid–code generations.
ComplexVCoder addresses these limitations via a two-stage pipeline that introduces structural scaffolding and domain knowledge during generation:
- NL → General Intermediate Representation (GIR): An instruction-tuned lightweight LLM translates concise natural language into a structured, hierarchical, JSON-like template—encoding modules, parameters, ports, instances, and connections.
- GIR → Verilog Code: The intermediate representation is converted into an enriched prompt via deterministic rule-based alignment, then augmented with retrieved relevant code from a large Verilog snippet database (RAG), before a second LLM outputs the final multi-level Verilog design.
This design aims to bridge the gap between open-source, lightweight LLMs and the performance of large, proprietary LLMs, offering systematic generation and improved correctness on complex RTL code (Zuo et al., 29 Apr 2025).
2. General Intermediate Representation (GIR) and Two-Stage Generation
The General Intermediate Representation (GIR) serves as the backbone of the framework’s two-stage synthesis process. GIR is a hierarchical, human- and machine-readable structure with the following fields: "modules", "parameters", "ports", "instances", and "connections".
Example GIR Fragment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{
"modules": [
{
"name": "adder_16bit",
"function": "Performs 16-bit addition with carry in/out.",
"parameters": [{"name": "WIDTH", "value": 16}],
"ports": [
{"name": "a", "dir": "input", "width": 16},
{"name": "b", "dir": "input", "width": 16},
{"name": "sum", "dir": "output", "width": 16}
],
"instances": [],
"connections": []
}
]
} |
NL → GIR Mapping: The first-stage LLM is instruction-tuned using 4,000 high-quality NL–GIR pairs, maximizing the likelihood across tokens, optimized by cross-entropy loss:
GIR → Verilog Mapping: A rule-based translator transforms GIR into descriptive prompts, augmented with RAG (Section 3), facilitating hierarchical, multi-module Verilog emission by the second-stage LLM.
3. Rule-Based Alignment and Retrieval-Augmented Generation
Rule-Based Alignment: Deterministic transformations convert GIR fields into explicit, structured prompt fragments, guiding the model to elaborate functional, control, and datapath logic. The process adheres to pseudocode:
1 2 3 4 5 6 7 8 9 |
Procedure build_prompt_from_GIR(ir): prompt ← "" for each module M in ir["modules"]: prompt += "Define module " + M["name"] + " with " + len(M["parameters"]) + " parameters, " + count_inputs(M) + " inputs and " + count_outputs(M) + " outputs. " for each port P in M["ports"]: prompt += "Port " + P["name"] + " is an " + P["dir"] + " of width " + P["width"] + ". " for each inst in M["instances"]: prompt += "Instantiate submodule " + inst["module"] + " as " + inst["alias"] + ". " return prompt |
Retrieval-Augmented Generation (RAG): The RAG component builds a database of 12,500+ well-formed single-module Verilog codes, each paired with a concise semantic descriptor. During inference:
- The GIR’s
"function"fields are embedded using a pretrained encoder. - Top-3 relevant code snippets are retrieved by cosine similarity:
- Optionally, a cross-encoder reranks candidates.
- Retrieved snippets are appended to the generative prompt for the final LLM code synthesis.
This retraining-free approach infuses domain-specific knowledge and canonical patterns, addressing domain errors without additional retriever fine-tuning.
4. Dataset and Benchmarking: ComplexVDB
ComplexVCoder is benchmarked using the ComplexVDB suite, which includes 55 real-world, structurally diverse Verilog designs (average 151 lines, 3 hierarchy levels). Each case encompasses:
- A concise NL description,
- A hand-crafted testbench,
- An end-to-end verification pipeline (Icarus/Yosys + Pyverilog).
Comparison to prior datasets:
| Benchmark | Size | Avg. Lines | Avg. Hierarchies |
|---|---|---|---|
| VerilogEval | 156 | 22 | 1 |
| RTLLM | 50 | 73 | 1 |
| ComplexVDB | 55 | 151 | 3 |
Evaluation relies on pass@k metrics (the probability that at least one out of samples passes both syntax and functional correctness checks), as well as syntax-only and function-only pass rates for finer granularity.
5. Empirical Results and Component Analysis
On the ComplexVDB benchmark, ComplexVCoder exhibits substantial improvements over SOTA models, particularly in terms of function correctness. Selected results:
| Model | #Params | pass@1 | pass@5 |
|---|---|---|---|
| Deepseek-V3 | 671B | 47.27% | 63.64% |
| GPT-3.5 | N/A | 34.55% | 58.18% |
| GPT-4o | N/A | 45.45% | 69.09% |
| CodeV (7B) | 7B | 20.00% | 29.09% |
| RTLCoder (6.7B) | 6.7B | 16.27% | 21.43% |
| ComplexVCoder (Qwen2.5-32B) | 32B | 36.36% | 58.18% |
| ComplexVCoder (Deepseek-V3) | 671B | 52.73% | 65.45% |
ComplexVCoder (Qwen2.5-32B) achieves pass@1 of 36.36%, closely matching GPT-3.5 and outperforming it at pass@5 (58.18% vs 58.18%). Against CodeV and RTLCoder, improvements of 14.6%–22.2% in functional correctness are observed.
On Problem-Set-VeriGen, ComplexVCoder (Qwen2.5-32B) attains pass@5 = 82.35%, exceeding ChipGPT (64.7%) and VeriGen (58.8%), and second only to GPT-4o (88.2%).
Ablation results (ComplexVDB, Qwen2.5-32B):
| System Variant | pass@1 | pass@5 |
|---|---|---|
| NL→Verilog (vanilla) | 25.45% | 50.91% |
| + Two-Stage Generation (TSG) | 36.36% | 58.18% |
| TSG, no Instruction Tuning | 32.72% | 54.55% |
| TSG, no Rule-Based Alignment | 29.63% | 50.91% |
| TSG, no RAG | 27.27% | 49.08% |
Each core component—GIR instruction tuning, rule-based alignment, and RAG—demonstrably contributes to performance gains.
6. Limitations and Prospective Directions
ComplexVCoder’s methodology emphasizes open, reproducible research by releasing all code, the ComplexVDB dataset, testbenches, and trained weights. Principal conclusions include:
- Two-stage, GIR-based pipelines enhance the synthesis of large, hierarchical RTL designs from concise NL, yielding higher correctness and structural integrity.
- Rule-based alignment and RAG augment LLM guidance without necessitating large-scale retriever retraining.
- Even lightweight, open-source LLMs (as low as 32B parameters) can reach parity with large, proprietary systems when embedded in this pipeline.
Future research directions involve extending the GIR to support SystemVerilog features (interfaces, assertions, clock domains), incorporating formal properties, automating GIR refinement via iterative feedback from failed simulations, and enriching the retrieval corpus with dynamic hardware profiling (latency/power) for multi-objective code synthesis (Zuo et al., 29 Apr 2025).