Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

SVG-CAD Datasets: Scalable Editable Graphics

Updated 2 September 2025
  • SVG-CAD datasets are annotated collections of scalable vector and CAD graphics that enable detailed AI modeling for design synthesis and automated reconstruction.
  • They integrate multimodal inputs such as images, text, CAD command sequences, and point clouds to support diverse design and engineering applications.
  • Advanced preprocessing techniques including canonicalization, componentization, and hierarchical sequencing enhance precision, editability, and semantic reasoning.

Scalable Vector Graphics (SVG)-CAD datasets are collections of digital vector graphics and computer-aided design (CAD) representations designed to facilitate machine learning and generative modeling of editable, resolution-independent, and parameterized graphics and engineering drawings. These datasets typically include SVG icons, engineering sketches, floor plans, and full document layouts, often paired with structured annotations, CAD command sequences, or multimodal inputs spanning images, text, and point clouds. SVG-CAD datasets underpin the training and evaluation of neural and hybrid models for tasks such as automated CAD design, SVG generation, reasoning-based graphic synthesis, parametric reconstruction, and industrial design automation.

1. Canonical Datasets: Scope, Structure, and Data Modalities

Recent advancements have resulted in several prolific SVG-CAD datasets, each characterized by unique data modalities, annotations, and target applications.

Dataset Size / Domain Content & Modality
SVG-Icons8 100,000 SVG icons, 56 categories Preprocessed, consistently scaled SVG paths
FloorPlanCAD 10,000+ architectural floor plans SVG with fine-grained, 30-category line annotations
Crello 23,182 design templates (SVG/PDF) Canvas + multi-modal element attributes
OpenECAD 130k 3D designs, 628k 2D sketches Images, code, and natural language annotations
ABC-mono 200,000+ 3D CAD models (7.5M samples) CAD models, rendered images, synthetic sketches
KOCAD ~300 real-world, photograph–CAD pairs Photos and ground-truth printed object CADs
Omni-CAD 450,000 CAD instances Command sequences, text, multi-view image, point clouds
SVGX-Dataset 240,000 curated SVGs Multi-source, hierarchical SVG, cleaned of redundancy
ColorSVG-100K 100,000 colored SVGs, 500 categories Component-aligned, color-normalized vector graphics
UniSVG 525,741 tasks (2.2M in full) Text, image, SVG code, and understanding benchmarks
CAD-VGDrawing Large-scale paired technical drawings Parametric SVG-like primitives ↔ CAD command seq.
SVGX-DwT-10k 10,000 prompt–reasoning–SVG triplets Natural language design rationale and SVG code
LayerTracer 20,000+ layered SVG sequences Sequential design operation grids from workflows

Most datasets incorporate either synthetic or professionally curated vector graphics, frequently paired with auxiliary modalities (rasterizations, text descriptions, CAD command logs, point clouds). SVG types typically span: icons and emoji, floor plans, engineering layouts, and design templates. Annotations can include instance/semantic categories, spatial attributes, hierarchy, and sequence-level construction logic.

2. Annotation Strategies and Representational Preprocessing

SVG-CAD datasets undergo elaborate preprocessing to maximize interpretability and support model training:

  • Canonicalization and Normalization: Datasets such as SVG-Icons8 (Carlier et al., 2020) and ColorSVG-100K (Chen et al., 13 Dec 2024) enforce consistent scales, coordinate normalization, and geometric centering to ensure stylistic coherence and robust statistical training properties.
  • Componentization: SVGBuilder (Chen et al., 13 Dec 2024) decomposes SVGs into discrete, reusable “components” (paths), normalizing and deduplicating via Jaccard index and union-find clustering for high reusability across designs.
  • Fine-Grained Annotation: FloorPlanCAD (Fan et al., 2021) delivers line-level labeling across up to 30 object types (walls, doors, appliances, etc.), supporting dense architectural reasoning tasks.
  • Hierarchical Sequencing: LayerTracer (Song et al., 3 Feb 2025) records full sequential design processes as grid-aligned blueprints, isolating layers and supporting reconstruction of the designer’s workflow via process-centric data representations.
  • Multi-Modality Augmentation: Datasets such as Omni-CAD (Xu et al., 7 Nov 2024) and UniSVG (Li et al., 11 Aug 2025) align SVG code, image renderings, natural language descriptions, point clouds, and instructive QA pairs, enabling multimodal LLM (MLLM) training and evaluation.

Preprocessing frequently employs data cleaning (removal of redundant or invisible elements, perceptual hashing for deduplication), bounding box normalization, class calibration (possibly with vision-LLMs such as CLIP), and attribute quantization (e.g., one-hot encoding for categorical/numeric features) (Yamaguchi, 2021).

3. SVG–CAD Pairing and Command Sequence Datasets

To support parametric modeling, many datasets explicitly pair SVG-like vector graphics with command sequences suitable for direct import into CAD environments:

  • Drawing2CAD (CAD-VGDrawing) (Qin et al., 26 Aug 2025) aligns tokenized SVG primitives (line, arc, etc.) with sequences of CAD operations, providing both geometric accuracy and design intent preservation.
  • OpenECAD (Yuan et al., 14 Jun 2024) converts public CAD datasets into (image, code) pairs, with code representations honoring executable CAD command APIs (add_line, add_arc, extrude, etc.), explicitly engineered for downstream integration.
  • Img2CAD (ABC-mono, KOCAD) (Chen et al., 4 Oct 2024) relates raster/sketch views to parametric CAD command outputs via Structured Visual Geometry (SVG)—wireframe-based vectorizations accurately mapped to sketch–extrude pipelines.
  • Omni-CAD (Xu et al., 7 Nov 2024) merges command sequences, textual prompts, raster renderings, and point clouds for each CAD design, encompassing a full multimodal generative and understanding suite.
  • LayerTracer (Song et al., 3 Feb 2025) and SVGX-DwT-10k (Xing et al., 30 May 2025) extend beyond flat annotation, capturing not only code but also the sequential creative/stylistic rationale (chain-of-thought) and intermediate representations.

A key development is the pairing of drawings with editable command logs, shifting datasets from representing static geometry towards supporting generation, editing, and reverse engineering with full parametric fidelity.

4. Benchmark Tasks and Evaluation Metrics

SVG-CAD datasets serve as the basis for a spectrum of benchmark tasks:

Datasets are evaluated by metrics including structural similarity (SSIM), FID, CLIPScore, BLEU for attribute sequences, layout mIoU, semantic alignment (cosine similarity with CLIP embeddings), and topology-aware scores (e.g., flux enclosure error, self-intersection ratio (Xu et al., 7 Nov 2024)). Reasoning-centric datasets introduce reward-based learning with hybrid metrics combining renderability, semantic alignment, and aesthetic quality.

5. Technical and Application Implications

SVG-CAD datasets enable a range of technical innovations:

  • Editability: Datasets with command or primitive-level annotation (e.g., ABC-mono, CAD-VGDrawing, OpenECAD’s code pairs) allow direct editability, critical for engineering workflows.
  • Semantic Reasoning: Datasets like SVGX-DwT-10k foster step-wise reasoning and interpretable code generation—key for AI-human collaboration and design transparency.
  • Multimodal Model Training: The advent of datasets such as UniSVG and Omni-CAD permits MLLMs to be trained for unified SVG understanding/generation from multi-source cues, resulting in models that surpass earlier closed-source baselines (e.g., outperforming GPT-4V on SVG U&G tasks (Li et al., 11 Aug 2025)).
  • Integration with Industry Tools: Datasets explicitly constructed for CAD API compatibility (OpenECAD, Drawing2CAD) lower integration barriers for industrial automation.
  • Accelerating the Design Pipeline: High-fidelity, componentized, and attribute-rich datasets like ColorSVG-100K and SVGX-Dataset make rapid, high-quality, and editable design generation viable for CAD, UI, and creative industries.

These implications have facilitated novel paradigms in reasoning-augmented design (“Aha moments” (Xing et al., 30 May 2025)), hierarchical process modeling, and efficient integration across creative and engineering workflows.

6. Ongoing Challenges and Prospective Developments

Despite these advances, SVG-CAD datasets continue to face domain-specific challenges:

  • Numerical Precision: Accurate recovery of floating-point shape and transform parameters, particularly for high-precision engineering or architectural CAD models, remains nontrivial and computationally demanding.
  • Hierarchical and Semantic Annotation: Scaling datasets with fine-grained, hierarchical labels (nested groups, layered processes, semantic tags) is labor intensive but critical for deep structural modeling.
  • Cross-Domain Generalization: Ensuring datasets support transfer across domains (artistic illustration, technical design, layout, animation) is essential for broad adoption; datasets like UniSVG address this by providing multi-task composition.
  • Reasoning Complexity: As demonstrated by Reason-SVG (Xing et al., 30 May 2025), explicitly modeling multi-stage creative process chains and integrating them with reward-based optimization is complex but necessary for interpretable and controllable generative systems.
  • Benchmark Harmonization: With diverse evaluation metrics for structure, style, semantics, editability, and topological correctness, establishing universal benchmarks remains an open area of research.

Future trajectories likely include the convergence of interactive, reasoning-augmented design datasets; deeper fusion with semantic, multimodal, and hierarchical CAD logs; and further refinement of evaluation standards for industrial relevance and scientific rigor across SVG-CAD automation.