RTLCoder: LLM for Verilog RTL

Updated 12 March 2026

RTLCoder is an open-source large language model framework that translates natural language instructions into Verilog RTL code with domain-specific fine-tuning.
Its architecture uses standard decoder-only transformers fine-tuned on curated datasets and enhanced via GPTQ 4-bit quantization for efficient local deployment.
Benchmark results demonstrate state-of-the-art performance in both syntax and functional correctness with minimal latency increases on consumer-grade hardware.

RTLCoder is an open-source LLM–based framework for Register Transfer Level (RTL) code generation, specifically targeting Verilog design from natural language descriptions. It is characterized by its domain specialization, compact parameter count (∼7B), efficiency for local deployment, and state-of-the-art performance among open-source models on standard generation and functional correctness benchmarks. RTLCoder’s modular architecture, dataset curation methodologies, and training objectives have influenced a broad spectrum of research in LLM-assisted hardware design.

1. Technical Architecture and Model Design

RTLCoder leverages standard decoder-only transformer architectures, specifically fine-tuning models such as Mistral-7B-v0.1 and DeepSeek-Coder-6.7b. No structural modifications (e.g., adapters, novel attention mechanisms) are introduced; model specialization is achieved entirely through domain-specific fine-tuning and training objectives.

Parameter count: 6.7–7B (Mistral-7B and DeepSeek-Coder-6.7B backbones).
Transformer configuration: ~32 layers, hidden size ≈4096–5120, ~32 heads per layer.
Deployment: Quantized post-training to 4-bit (GPTQ), reducing model size from ~14 GB (FP16) to ~4 GB.
Efficiency: Local inference with only 5% per-token latency increase (CPU) and <3% (GPU) due to quantization, enabling execution on consumer-grade hardware (e.g., 8 GB VRAM laptops).

RTLCoder supports a fully open-source installation and usage pipeline, allowing model, dataset, and inference code to be deployed locally, facilitating industry use in privacy-sensitive environments (Liu et al., 2023).

2. Training Pipeline and Data Curation

RTLCoder’s performance is driven by the design and scale of its supervised training corpus and a code-quality–aware training objective.

Data generation: Automated three-stage GPT-3.5 pipeline produces ~27,000 instruction–reference code pairs.
- Phase 1: Extraction of hundreds of RTL block keywords (e.g., “Wallace tree multiplier”).
- Phase 2: Natural language instruction generation via random sampling and rule-based mutation.
- Phase 3: Reference code generation, synthesizing 5 candidate Verilog implementations per instruction, filtered for syntactic validity via Pyverilog.
Deduplication: Samples with ROUGE-L > 0.5 similarity to test/benchmark cases are removed to prevent contamination.
Context length: 2048 tokens per sample (instruction + code).

Training objective:

Loss is a sum of the usual maximum likelihood loss and a code-quality ranking term. Given K candidates with normalized log-probabilities, candidates with higher external quality scores (z) are trained to outscore lower-quality candidates by a margin λ: $\text{loss} = \text{loss}_{\text{MLE}} + \text{loss}_{\text{compare}}$

$\text{loss}_{\text{compare}} = \sum_{k, \tau: z_k < z_\tau} \max(s_k-s_\tau+\lambda, 0)$

where $s_k$ is the softmax of average log probabilities.

Batch size of 256 sequences is used (distributed over 4×RTX 4090 GPUs) (Liu et al., 2023).

3. Quantization, Deployment, and Privacy

RTLCoder applies post-training GPTQ 4-bit quantization over all parameters. Each tensor row is linearly quantized, mapping weights to 4 bits via per-row scale and zero point, resulting in:

$q = \text{clip}(\text{round}(w/\text{scale}) + \text{zero\_point}, 0, 15)$

Footprint reduction: ≈70% reduction compared to FP16.
Runtime efficiency: <5% latency increase per token on CPU, <3% on GPU.
Privacy: All computation is local; no code or data is transmitted externally. All datasets and code are open-source, enabling complete auditability and retraining on private corpora.

The full deployment workflow is encapsulated by a pip package, checkpoint download, and simple API for prompting:

1
2
3

from rtlcoder import RTLModel
model = RTLModel.load_local("rtlcoder-7b-4bit")
code = model.generate("Design a pipelined 4-bit Booth multiplier")

4. Benchmarking and Empirical Performance

RTLCoder is evaluated on several public, open-source benchmarks, including VerilogEval [Liu et al., ICCAD’23] and RTLLM v1.1 [ASP-DAC’24], using pass@k (fraction of tasks for which at least one of k model outputs passes all correctness checks), syntax checking, and simulation-based functional validation.

Model	EvalMachine pass@1	pass@5	RTLLM Func (%)
GPT-3.5	46.7%	69.1%	33.0
GPT-4	60.0%	70.6%	50.0
RTLCoder-Mistral (7B)	62.5%	72.2%	48.3
RTLCoder-DeepSeek (6.7B)	61.2%	76.5%	40.0

Key findings:

RTLCoder-Mistral exceeds GPT-4 in pass@1 on VerilogEval-Machine (+2.5%), and closely matches GPT-4 on functionality (48.3% vs. 50.0%).
4-bit quantization yields negligible performance loss (1–2% absolute) with substantial memory savings.

5. Datasets, Verification, and Quality Control

RTLCoder was developed concurrently with major open-source RTL datasets and verification workflows:

RTLCoder-Data (see OpenLLM-RTL (Liu et al., 19 Mar 2025)): 80K instruction–code pairs, 7K subset formally verified with SystemVerilog Assertions and JasperGold.
RTLLM 2.0: 50 hand-crafted benchmarks with design specs, testbenches, and golden RTL for functional and PPA evaluation.
AssertEval: Benchmarks for SystemVerilog assertions, enabling functional property-checking against LLM output.
Datasets are diverse: compression ratio CR=4.21–4.32, extracted from hundreds of logic block types, supporting both instruction→code and spec→assertion pipelines.

Saturation studies indicate direct fine-tuning on larger datasets (50K+ samples) improves pass@k without plateauing at 80K, while smaller, formally-verified subsets (e.g., 7K) reach near-maximal accuracy per training epoch (Liu et al., 19 Mar 2025).

6. Comparison to Successors and Extensions

Subsequent research has extended RTLCoder’s methodological base:

RTLRepoCoder (Wu et al., 11 Apr 2025):
- Trains on real multi-file GitHub repos, extends context length to 10,240 tokens, and introduces retrieval-augmented generation (RAG) for repository-level code completion.
- Achieves 84.3% Edit Similarity, 55.8% Exact Match on the RTL-Repo benchmark, surpassing GPT-4 and RTLCoder.
VeriCoder (Wei et al., 22 Apr 2025):
- Focuses on functional correctness by constructing a 125K-sample dataset through LLM-driven unit test generation and simulation, resulting in up to 71.7% improvement (relative) on VerilogEval pass@1 over RTLCoder.
DeepRTL2 (Liu et al., 28 May 2025):
- Unifies code generation and embedding-based tasks (semantic search, equivalence, PPA prediction), applying curriculum learning, GRIT multi-task training, and hard negatives.
Spec2RTL-Agent (Yu et al., 16 Jun 2025):
- Goes beyond single-turn LLM prompting, orchestrating multi-agent decomposition, iterative code refinement (pseudocode→Python→C++→RTL via HLS), and agent-based reflection for complex, multi-page natural-language specifications.

RTLCoder’s signature pipeline—synthetic dataset curation, quality-aware objective, open-source efficiency, and privacy—has become a canonical reference point in LLM-driven EDA research. Ongoing work explores larger verified datasets, integration with formal methods, and architectural improvements to context handling and cross-file reasoning.

7. Limitations and Open Research Questions

Limitations acknowledged in the RTLCoder literature and by subsequent work include:

Context window constraint: 2048-token window limits cross-file or repository-level guidance; later frameworks expanded this (e.g., RTLRepoCoder’s 10,240 tokens).
Synthetic data bias: Dataset generated via LLMs may not capture the full diversity or complexity of real-world code.
Functional correctness gaps: Early RTLCoder training was validated syntactically, not via simulation or assertion-based property checking—leading to undetected semantic bugs.
Absence of advanced verification: Lacks in-loop formal equivalence or PPA optimization during generation.

Future research directions, reflected in sequels, target larger and functionally validated training sets, sophisticated retrieval for repository-scale code, multi-agent planning and reflection, and formal co-verification workflows during generation (Wu et al., 11 Apr 2025, Wei et al., 22 Apr 2025, Liu et al., 19 Mar 2025, Liu et al., 28 May 2025, Yu et al., 16 Jun 2025).