iDATA Dataset: AI-Driven EDA Insights

Updated 11 November 2025

iDATA is a large-scale, multi-modal dataset providing structured vector representations from 50 real 28nm designs for AI-driven EDA research.
It uses a comprehensive design-to-vector pipeline that converts raw EDA outputs into standardized JSON formats across design, net, graph, path, and patch modalities.
The dataset demonstrates significant predictive performance improvements in benchmarks such as delay prediction and routing mask generation, underlining its research impact.

iDATA is a large-scale, multi-level dataset specifically designed for AI-driven electronic design automation (AI-EDA) research. Built using the AiEDA open-source library, iDATA encapsulates structured vector representations and foundation data extracted from 50 real 28nm integrated circuit designs. The dataset consists of approximately 600 GB of data supporting AI tasks such as prediction, generation, optimization, and comparative analysis across multiple abstraction layers, including design, net, graph, path, and patch modalities. Both the dataset and supporting tools are intended to mitigate traditional EDA data pipeline issues by providing standardized, programmatic access for end-to-end workflows (Qiu et al., 8 Nov 2025).

1. Dataset Generation Pipeline

The iDATA generation process utilizes AiEDA's design-to-vector pipeline, which transforms raw EDA tool outputs (from both open-source and commercial flows) into structured, multi-modal foundation data. This pipeline spans the following stages:

Input Artifacts: Register Transfer Level (RTL) code, library files (.lef, .lib), netlists (.blif, .v), timing constraints (.sdc), parasitic files (.spef), and layout outputs (.gds).
Flow Engines: Full digital ASIC flows using OpenROAD, iEDA, Innovus, or PrimeTime to produce synthesis, floorplan, placement, clock tree synthesis, routing, and signoff data.
Data Extraction APIs: At every tool stage, AiEDA extracts key signals:
- Scalar metrics: HPWL (half-perimeter wirelength), RWL (routed wirelength), power, timing (WNS/TNS)
- Spatial maps: congestion, DRC, IR-drop metrics
- Logical information: netlists and timing paths
- Geometric primitives: wires, vias, polygons
Vectorization: Conversion algorithms decompose native EDA outputs into five structured modalities:
1. Design (die, summary statistics)
2. Net (net-specific, gate-level detail)
3. Graph (nodes as instances, edges as connectivity)
4. Path (timing-critical subgraphs)
5. Patch (spatial grid features)

The pipeline formalizes extraction and representation steps such as netlist-to-hypergraph conversion and layout discretization, using vector primitives for all data objects to enable downstream ML tasks.

2. Data Schema and Modalities

All structured data in iDATA is provided in JSON format within a standardized "workspace" directory. The five principal modalities and their fields are as follows:

Level	Structure	Example Data Types
Design	design.json	Die size, cell/net count, HPWL, congestion/IR-drop maps
Net	/vector/net/net_{id}.json	Pin coordinates, wire segments, parasitics, delay/slew/fanout
Graph	graph.json	Nodes (instance features), edges (connections/features)
Path	/vector/path/path_{id}.json	Stagewise RC, delays, segment wire info
Patch	/vector/patch/patch_{i,j}.json	Cell/net/pin density, RUDY, congestion, per-layer wire features

Notably, net-level files represent nets as sparse incidence lists (hypergraph formalism), while layout features are rasterized as multi-channel binary masks per metal or via layer. Spatial tensors are provided as NumPy arrays for cross-modal ML.

A representative snippet of a net-level JSON:

{
  "net_id": "N123",
  "pins": [
    {"inst_id": "U5", "pin_name": "A", "x": 12.3, "y": 45.6},
    {"inst_id": "U8", "pin_name": "Z", "x": 78.9, "y": 10.2}
  ],
  "bbox": [12.3, 10.2, 78.9, 45.6],
  "rc_parasitics": {"R_total": 120.5, "C_total": 0.34},
  "wire_segments": [
    [12.3,45.6,30.0,45.6,2],
    [30.0,45.6,30.0,20.0,4]
  ],
  "delay": 0.125,
  "slew": 0.081,
  "fanout": 3
}

3. Data Representation and Feature Extraction

AiEDA's representational framework supports hierarchical embeddings suitable for modern deep learning:

Cell/Instance Embeddings: Each standard cell or macro is represented as a feature vector (e.g., area, pin count, one-hot cell type).
Net-Level Graphs: Hypergraph (gates as vertices, nets as hyperedges) encodings, convertible to bipartite or incidence list formats for graph neural network (GNN) models.
Layout Tensors: Binary or multi-valued matrices for each layer; tensors $\mathbf{L} \in \{0,1\}^{L \times H \times W}$ .
Patch Grids: Channel-wise feature tensors per spatial patch $\mathbf{P} \in \mathbb{R}^{C \times H_p \times W_p}$ .
Path Embeddings: Sequential embedding using multi-scale 1D convolution, self-attention (MHA), and cross-attention layers for modeling timing delay.

Pseudocode for net vectorization:

def vectorize_net(def_net):
    wires = []
    for segment in def_net.segments:
        if segment.type == 'metal':
            wires.append([x_start, y_start, x_end, y_end, layer])
        elif segment.type == 'via':
            wires.append([x_center, y_center, bottom_layer, top_layer])
    return wires

This organization enables feature extraction at variable granularity for training and evaluating ML models for EDA.

4. Dataset Statistics

iDATA comprises, across 50 real designs:

23.26 million total standard cells
21.47 million nets represented as individual files (236 GB for net-level data)
347.15 million wire segments
1.63 million path-level vectors (150 GB)
1.61 million patch-level vectors (207 GB)

Design complexity distribution ranges from 135 to over 1 million cells per circuit. Layer analysis indicates that Metal4 accounts for approximately 40% of total wirelength, followed by Metal2 and Metal6 (each ~20%), with via layers <5%. Instance type breakdown: 85% logic cells, 15% clock cells, <0.1% macros, and <0.1% IO pads.

5. Storage, Tools, and Programmatic Access

Each design's workspace follows a standardized directory tree including configuration, tool reports, scalar features, and all vector modalities. All vectorized data is accessible via published Python APIs in the AiEDA library:

from aieda.workspace import workspace_create
from aieda.data import DataVectors

ws = workspace_create("runs/designX", tool="Innovus")
data = DataVectors(ws)

nets = data.load_nets()
patches = data.load_patchs()
timing_paths = data.load_timing_paths()

Workspaces include all supporting configuration and report files for traceability and reproducibility. All data files use open, machine-parsable formats.

6. Benchmark Tasks and Empirical Validation

iDATA's effectiveness has been evaluated on seven AI-EDA benchmark tasks, with precise model choices and results reported:

Net-Level Wirelength Prediction: Two-stage TabNet predicts routed wirelength, $R^2=0.94$ , mean relative error reduced by 6% with via features.
Path-Level Delay Prediction: Transformer plus multi-scale Conv1D and gated residual network (GRN); achieved MAE=0.023 ns and MRE=7.7% (vs baseline MRE of 50%).
Graph-Level Delay Prediction: GIN+Transformer architecture on critical subgraphs; MSE=0.025, $R^2=0.965$ .
Patch-Level Congestion Prediction: U-Net (sliding window), NRMSE=0.18 (minimum 0.12 on large designs).
Net & Map-Level Routing Mask Generation: U-Net, F1=0.81, IoU=0.67.
Design-Level Parameter Optimization: Multi-objective TPE improved HPWL in 8/10 and WNS/TNS in 10/10 benchmarks.
Design-Level Metric Analysis & Tool Comparison: Comparative evaluation of iEDA vs Innovus over 30 designs established that favorable intermediate metrics (e.g., RSMT, congestion) do not guarantee superior final PPA, highlighting the need for holistic, multi-metric optimization.

7. Applications and Impact

iDATA directly supports research in predictive modeling (delay, wirelength, congestion), generative tasks (layout, routing mask), optimization (placement parameter autotuning), and analysis (tool benchmarking). Integration via open-source AiEDA APIs enables reproducible pipelines for both academic and industrial ML-EDA research. The dataset addresses fragmentation in cross-tool EDA data conversion, providing standardized, universal representations for direct consumption by ML models. Data, code, and tools are available at https://github.com/OSCC-Project/AiEDA, with the full dataset prepared for public release (Qiu et al., 8 Nov 2025).

A plausible implication is that iDATA will serve as a foundation for benchmarking, pretraining, and transfer learning in emerging AI-EDA workflows, facilitating comparative research and reducing data engineering overhead in chip design automation.

PDF Markdown Chat (Pro)

References (1)

AiEDA: An Open-Source AI-Aided Design Library for Design-to-Vector (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to iDATA Dataset.