NL-to-RTL Datasets: Scale, Synthesis & Benchmarks

Updated 13 November 2025

Natural language-to-RTL datasets are structured resources that pair NL specifications with RTL code, testbenches, and annotations to benchmark hardware synthesis models.
They employ manual, LLM-driven, and verification-based methodologies to ensure both syntactic correctness and functional validity in auto-generated hardware designs.
Rigorous benchmarking protocols measuring syntax, functional accuracy, and design quality reveal that scale and data diversity directly enhance model performance.

Natural language-to-RTL datasets are structured resources pairing natural-language specifications with register-transfer level (RTL) hardware code—typically Verilog or SystemVerilog—and, increasingly, with testbenches and functional annotations. These datasets enable the training, benchmarking, and evaluation of LLM-based systems for automatic hardware synthesis, evaluation, and verification. The rapid evolution of corpus size, annotation methodology, and benchmarking frameworks reflects the emerging demand for robust, large-scale resources suitable for supervised learning and agentic code generation flows.

1. Corpus Taxonomy and Scale

Datasets for natural language-to-RTL (NL→RTL) tasks span training sources (for supervised fine-tuning), held-out benchmarks (for evaluation), and hybrid frameworks that blend data generation with verification. The major open resources released since 2023 include:

Name/Version	Size (examples)	Composition
RTLCoder-Data (raw)	80,000	NL instructions, Verilog code
RTLCoder-Data (verified)	7,000	Functionally verified NL-Verilog
VeriCoder	125,777	NL spec, Verilog, self-checking testbench
RTLLM 2.0	50	Hand-crafted NL, Verilog, testbench
DeepCircuitX	>4,000 repos	Repo/module/block NL, code, PPA
VerilogEval (ICCAD/2023)	156	NL prompt, full RTL, testbench
RTLLM (ASP-DAC/2024)	22–30	NL spec, Verilog, testbench

Training datasets (e.g., RTLCoder-Data, VeriCoder, DeepCircuitX) provide hundreds of thousands of NL-code pairs, with schema expanding to include functional validation and hierarchical annotations. Benchmarks (RTLLM, VerilogEval) favor hand-crafted examples with rigorous pass/fail testbenches.

A plausible implication is that the scale and diversity of training corpora directly condition the upper bounds on LLM quality and generalization for NL→RTL tasks.

2. Data Construction Methodologies

Dataset compilation spans manual authoring, automated LLM-based generation, and hybrid pipelines with formal or testbench-based verification:

Manual Specification: Hand-written NL descriptions, RTL, and testbenches (RTLLM, RTLLM 2.0).
LLM-driven Synthesis: Use of GPT-3.5, GPT-4o-mini, or comparable models for prompt extension, code synthesis, mutation, and automated NL generation (RTLCoder-Data, VeriCoder).
Verification-Driven Filtering: Automated simulation with Icarus Verilog/VCS; JasperGold-based formal property checking; assertion-based pruning (VeriCoder, OpenLLM-RTL, RTLCoder-Data verified subset).
Functional Validation Pipeline ([VeriCoder, (Wei et al., 22 Apr 2025)]):
- Given corpus $D = \{(s_i, d_i)\}$ , for each example up to $T$ attempts:
- 1. Generate unit test $t$ via LLM prompt ("GenTestTpl").
- 2. Simulate $(d, t)$ ; if fail, collect error log.
- 3. Refine $(d, t)$ using error feedback ("RefineTpl").
- 4. Accept $(s, d, t)$ only if testbench passes.
- Result: each dataset entry contains a NL spec, syntactically correct RTL, and a testbench passing under simulation.

Such pipelines enable the construction of functionally verified datasets at scale, but require sophisticated LLM prompting and high-throughput simulation environments. A plausible implication is that datasets without functional validation tend to produce models with high syntactic but lower functional correctness.

3. Annotation Schemas and Data Formats

Recent corpora emphasize rich schema that support multi-task learning and interpretable benchmarking:

Flat NL→RTL Pairs: A simple JSONL row with
- id, instruction (NL), code (Verilog), keywords, category
- Example (RTLCoder-Data): { "id": 123, "instruction": "...", "code": "...", "keywords": [...], "category": "arithmetic" }
Validated Examples with Testbenches:
- VeriCoder: Each example { "id": ..., "spec": "...", "design": "...", "test": "..." } where test is a self-checking Verilog testbench.
Hierarchical Annotations ([DeepCircuitX, (Li et al., 25 Feb 2025)]):
- repo_annotation, per-module: module_code, module_annotation, list of blocks { block_type, block_code, block_annotation }.
Benchmark Directories (RTLLM, VerilogEval):
- design_description.txt, testbench.v, designer_RTL.v, plus supporting synth/sim scripts; each in a separate directory per module.

Functional testbenches are typically encoded in Verilog, using $fatal`, `$ display, or assertion statements to encode pass/fail criteria. A plausible implication is that data formats supporting hierarchical context (repo → module → block) facilitate both code understanding and hybrid generation tasks.

4. Benchmarks, Metrics, and Evaluation Protocols

Benchmarking protocols for NL→RTL models employ stringent validation:

Syntax Accuracy: Parsing and synthesis of generated RTL without error (Synopsys Design Compiler, VCS).
Functional Accuracy: Simulation of RTL against a golden testbench; must pass all test vectors.
- $\text{Functional Accuracy} = \frac{\#\text{passed testbenches}}{\#\text{total modules}}$
Pass@k: Given $n$ candidate generations and $c$ successes,

$\text{Pass@}k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$
Design Quality (PPA): Measurement of post-synthesized area, power, and worst negative slack (WNS); normalized against reference designs.

Modern evaluation suites (RTLLM, VerilogEval) also classify failure modes as syntax errors, simulation timeouts, reset mismatches, etc.—providing actionable diagnostics and insights into systematic model weaknesses.

Human-in-the-loop augmentation is applied where testbenches are missing (e.g., PEFA-AI authors manually write testbenches for RTLLM 1.1).

A plausible implication is that benchmarks lacking functional verification or failure classification may yield inflated pass rates, masking deeper code defects.

5. Impact of Data Quality, Coverage, and Verification

Empirical studies demonstrate substantial gains in functional correctness for models trained on functionally validated corpora:

VeriCoder dataset (functionally validated, 125k examples) yields pass@1 of 55.7% on VerilogEval, versus 35.9% for syntactic-only dataset OriGen (Wei et al., 22 Apr 2025).
Ablation shows that validated data improves functional correctness by ~3.5 percentage points over unvalidated sets.
Scaling dataset size (e.g., RTLCoder-Data: 5k→80k) monotonically increases pass rates, with no observed saturation ([OpenLLM-RTL, (Liu et al., 19 Mar 2025)]).
Filtering for prompt–benchmark overlap (Rouge-L < 0.3) ensures minimal leakage and reliable generalization.
Verification-based curation (JasperGold SVA) reduces false positives but may discard correct samples if assertion synthesis fails.

A plausible implication is that functional correctness validation—either via simulation-based testbenches or formal property checking—is essential for training models suited to deployment in real EDA flows.

6. Best Practices and Open Challenges

The literature synthesizes several best practices for future dataset development:

Use clear, concise NL specifications, including explicit port lists and well-specified timing/state-machine behavior.
Ensure coverage across combinational logic, sequential circuits, parameterization, and complex control (FSMs, pipelines).
Automate failure labeling and error analysis (via tool-driven logs and classifier scripts).
Prefer functionally validated training samples and low-overlap with benchmarks to prevent data leakage.
Adopt hierarchical annotation for repository-level datasets to enable multi-granular training (DeepCircuitX).

Challenges persist: true functional equivalence is not guaranteed by testbenches (corner cases may be missed), assertion-based filtering depends on correct property synthesis, and extremely complex RTL designs (multi-clock, co-simulation) are underrepresented. Licensing and IP constraints may also limit broader distribution and adoption.

7. Licensing, Access, and Future Directions

Licensing for NL→RTL datasets is typically Apache 2.0 (RTLCoder, RTLLM), CC-BY, or similar open-source models, with some resources (e.g. DeepCircuitX, OpenLLM-RTL) recommending direct review of repo LICENSE files. Most datasets are available via GitHub or dedicated repositories, with scripts provided for download, preprocessing, and baseline evaluation.

A plausible implication for future work is the continued expansion of scale, diversity, and functional annotation in NL-to-RTL datasets, with emphasis on agentic flows, assertion-based verification, and fine-grained error reporting. Such resources—the unified playground spanning RTLCoder-Data, RTLLM 2.0, AssertEval, and beyond—have demonstrably accelerated the empirical and methodological rigor of LLM-based hardware design research, enabling both apples-to-apples benchmarking and robust model development.