NL-to-RTL Datasets: Scale, Synthesis & Benchmarks
- Natural language-to-RTL datasets are structured resources that pair NL specifications with RTL code, testbenches, and annotations to benchmark hardware synthesis models.
- They employ manual, LLM-driven, and verification-based methodologies to ensure both syntactic correctness and functional validity in auto-generated hardware designs.
- Rigorous benchmarking protocols measuring syntax, functional accuracy, and design quality reveal that scale and data diversity directly enhance model performance.
Natural language-to-RTL datasets are structured resources pairing natural-language specifications with register-transfer level (RTL) hardware code—typically Verilog or SystemVerilog—and, increasingly, with testbenches and functional annotations. These datasets enable the training, benchmarking, and evaluation of LLM-based systems for automatic hardware synthesis, evaluation, and verification. The rapid evolution of corpus size, annotation methodology, and benchmarking frameworks reflects the emerging demand for robust, large-scale resources suitable for supervised learning and agentic code generation flows.
1. Corpus Taxonomy and Scale
Datasets for natural language-to-RTL (NL→RTL) tasks span training sources (for supervised fine-tuning), held-out benchmarks (for evaluation), and hybrid frameworks that blend data generation with verification. The major open resources released since 2023 include:
| Name/Version | Size (examples) | Composition |
|---|---|---|
| RTLCoder-Data (raw) | 80,000 | NL instructions, Verilog code |
| RTLCoder-Data (verified) | 7,000 | Functionally verified NL-Verilog |
| VeriCoder | 125,777 | NL spec, Verilog, self-checking testbench |
| RTLLM 2.0 | 50 | Hand-crafted NL, Verilog, testbench |
| DeepCircuitX | >4,000 repos | Repo/module/block NL, code, PPA |
| VerilogEval (ICCAD/2023) | 156 | NL prompt, full RTL, testbench |
| RTLLM (ASP-DAC/2024) | 22–30 | NL spec, Verilog, testbench |
Training datasets (e.g., RTLCoder-Data, VeriCoder, DeepCircuitX) provide hundreds of thousands of NL-code pairs, with schema expanding to include functional validation and hierarchical annotations. Benchmarks (RTLLM, VerilogEval) favor hand-crafted examples with rigorous pass/fail testbenches.
A plausible implication is that the scale and diversity of training corpora directly condition the upper bounds on LLM quality and generalization for NL→RTL tasks.
2. Data Construction Methodologies
Dataset compilation spans manual authoring, automated LLM-based generation, and hybrid pipelines with formal or testbench-based verification:
- Manual Specification: Hand-written NL descriptions, RTL, and testbenches (RTLLM, RTLLM 2.0).
- LLM-driven Synthesis: Use of GPT-3.5, GPT-4o-mini, or comparable models for prompt extension, code synthesis, mutation, and automated NL generation (RTLCoder-Data, VeriCoder).
- Verification-Driven Filtering: Automated simulation with Icarus Verilog/VCS; JasperGold-based formal property checking; assertion-based pruning (VeriCoder, OpenLLM-RTL, RTLCoder-Data verified subset).
- Functional Validation Pipeline ([VeriCoder, (Wei et al., 22 Apr 2025)]):
- Given corpus , for each example up to attempts:
- 1. Generate unit test via LLM prompt ("GenTestTpl").
- 2. Simulate ; if fail, collect error log.
- 3. Refine using error feedback ("RefineTpl").
- 4. Accept only if testbench passes.
- Result: each dataset entry contains a NL spec, syntactically correct RTL, and a testbench passing under simulation.
Such pipelines enable the construction of functionally verified datasets at scale, but require sophisticated LLM prompting and high-throughput simulation environments. A plausible implication is that datasets without functional validation tend to produce models with high syntactic but lower functional correctness.
3. Annotation Schemas and Data Formats
Recent corpora emphasize rich schema that support multi-task learning and interpretable benchmarking:
- Flat NL→RTL Pairs: A simple JSONL row with
id,instruction(NL),code(Verilog),keywords,category- Example (RTLCoder-Data):
{ "id": 123, "instruction": "...", "code": "...", "keywords": [...], "category": "arithmetic" }
- Validated Examples with Testbenches:
- VeriCoder: Each example
{ "id": ..., "spec": "...", "design": "...", "test": "..." }wheretestis a self-checking Verilog testbench.
- VeriCoder: Each example
- Hierarchical Annotations ([DeepCircuitX, (Li et al., 25 Feb 2025)]):
repo_annotation, per-module:module_code,module_annotation, list of blocks{ block_type, block_code, block_annotation }.
- Benchmark Directories (RTLLM, VerilogEval):
design_description.txt,testbench.v,designer_RTL.v, plus supporting synth/sim scripts; each in a separate directory per module.
Functional testbenches are typically encoded in Verilog, using display, or assertion statements to encode pass/fail criteria. A plausible implication is that data formats supporting hierarchical context (repo → module → block) facilitate both code understanding and hybrid generation tasks.
4. Benchmarks, Metrics, and Evaluation Protocols
Benchmarking protocols for NL→RTL models employ stringent validation:
- Syntax Accuracy: Parsing and synthesis of generated RTL without error (Synopsys Design Compiler, VCS).
- Functional Accuracy: Simulation of RTL against a golden testbench; must pass all test vectors.
- Pass@k: Given candidate generations and successes,
- Design Quality (PPA): Measurement of post-synthesized area, power, and worst negative slack (WNS); normalized against reference designs.
Modern evaluation suites (RTLLM, VerilogEval) also classify failure modes as syntax errors, simulation timeouts, reset mismatches, etc.—providing actionable diagnostics and insights into systematic model weaknesses.
Human-in-the-loop augmentation is applied where testbenches are missing (e.g., PEFA-AI authors manually write testbenches for RTLLM 1.1).
A plausible implication is that benchmarks lacking functional verification or failure classification may yield inflated pass rates, masking deeper code defects.
5. Impact of Data Quality, Coverage, and Verification
Empirical studies demonstrate substantial gains in functional correctness for models trained on functionally validated corpora:
- VeriCoder dataset (functionally validated, 125k examples) yields pass@1 of 55.7% on VerilogEval, versus 35.9% for syntactic-only dataset OriGen (Wei et al., 22 Apr 2025).
- Ablation shows that validated data improves functional correctness by ~3.5 percentage points over unvalidated sets.
- Scaling dataset size (e.g., RTLCoder-Data: 5k→80k) monotonically increases pass rates, with no observed saturation ([OpenLLM-RTL, (Liu et al., 19 Mar 2025)]).
- Filtering for prompt–benchmark overlap (Rouge-L < 0.3) ensures minimal leakage and reliable generalization.
- Verification-based curation (JasperGold SVA) reduces false positives but may discard correct samples if assertion synthesis fails.
A plausible implication is that functional correctness validation—either via simulation-based testbenches or formal property checking—is essential for training models suited to deployment in real EDA flows.
6. Best Practices and Open Challenges
The literature synthesizes several best practices for future dataset development:
- Use clear, concise NL specifications, including explicit port lists and well-specified timing/state-machine behavior.
- Ensure coverage across combinational logic, sequential circuits, parameterization, and complex control (FSMs, pipelines).
- Automate failure labeling and error analysis (via tool-driven logs and classifier scripts).
- Prefer functionally validated training samples and low-overlap with benchmarks to prevent data leakage.
- Adopt hierarchical annotation for repository-level datasets to enable multi-granular training (DeepCircuitX).
Challenges persist: true functional equivalence is not guaranteed by testbenches (corner cases may be missed), assertion-based filtering depends on correct property synthesis, and extremely complex RTL designs (multi-clock, co-simulation) are underrepresented. Licensing and IP constraints may also limit broader distribution and adoption.
7. Licensing, Access, and Future Directions
Licensing for NL→RTL datasets is typically Apache 2.0 (RTLCoder, RTLLM), CC-BY, or similar open-source models, with some resources (e.g. DeepCircuitX, OpenLLM-RTL) recommending direct review of repo LICENSE files. Most datasets are available via GitHub or dedicated repositories, with scripts provided for download, preprocessing, and baseline evaluation.
A plausible implication for future work is the continued expansion of scale, diversity, and functional annotation in NL-to-RTL datasets, with emphasis on agentic flows, assertion-based verification, and fine-grained error reporting. Such resources—the unified playground spanning RTLCoder-Data, RTLLM 2.0, AssertEval, and beyond—have demonstrably accelerated the empirical and methodological rigor of LLM-based hardware design research, enabling both apples-to-apples benchmarking and robust model development.