SketchGraphs: CAD Sketch Dataset
- SketchGraphs is a dataset of CAD sketches represented as richly structured geometric constraint graphs derived from 15 million real-world CAD models.
- It enables generative modeling and constraint inference with machine learning, supporting tasks like sketch autocompletion and automated constraint suggestion.
- An open-source processing pipeline with constraint-preserving transformations ensures robust data extraction, augmentation, and seamless CAD integration.
SketchGraphs is a large-scale, open dataset designed for modeling relational geometry in parametric computer-aided design (CAD). Distinguished by its representation of 2D CAD sketches as richly structured geometric constraint graphs, SketchGraphs serves as a foundational resource for the development and benchmarking of machine learning and program synthesis methods aimed at understanding, generating, and manipulating CAD sketches. As the benchmark dataset in this domain, SketchGraphs has catalyzed a wave of research into CAD design automation, generative modeling, and constraint inference.
1. Dataset Structure and Representation
SketchGraphs comprises 15 million sketches extracted from real-world parametric CAD models, specifically sourced from Onshape (Seff et al., 2020). Each sketch is formalized as a geometric constraint graph :
- Nodes : Represent geometric primitives such as Point, Line, Circle, Arc, and less frequently, Spline and Ellipse. Each primitive is richly parameterized; for example, a Line includes direction vector , point , and interval parameters, while a Circle has center coordinates, a unit direction, radius, and directional flags. Every primitive may possess an
isConstruction
Boolean, distinguishing physically realized elements from construction aids. - Edges : Denote explicit geometric constraints imposed by designers, such as Coincident, Projected, Distance, Horizontal, Mirror, Vertical, Parallel, Length, Perpendicular, Tangent, etc. Constraints have parameters: numerical (distance, angle), categorical (direction, halfSpace), or Boolean.
- Construction Sequence: Although stored as graphs, sketches include sequential construction histories, aligning with the operation-based workflow of parametric CAD tools. Earlier primitives often serve as anchors and exhibit higher connectivity.
The dataset tabulates global statistics, such as the frequency of primitive and constraint types, as well as histograms for length and angle parameters.
2. Data Processing Pipeline
SketchGraphs includes an open-source pipeline for data extraction, transformation, and rendering (Seff et al., 2020):
- Sketch Extraction: Via Onshape API, extracting only sketches with at least one primitive and one constraint.
- Parsing and Representation: Transformation from Onshape’s over-parameterized formats to canonical representations, supporting multi-edges and hyperedges for complex constraints.
- Domain Classes: Custom classes encode primitives and constraints, supporting conversion, manipulation, and rendering.
- Rendering Utilities: Visualization tools depict construction lines versus physically realized geometry for model interpretability.
- Deep Learning Integration: Canonical sequences (e.g., alternating primitives and constraints) facilitate downstream tasks, enabling compatibility with autoregressive generative models.
3. Benchmarks, Use Cases, and Machine Learning Integration
SketchGraphs underpins two primary modeling tasks (Seff et al., 2020):
- Generative Modeling: Unconditional models aim to synthesize sketches as sequences of primitives and constraints. Performance is quantified via negative log-likelihood (NLL) (baseline: 28.2 bits per graph), comparability with LZMA compression, and distributional alignment with ground truth statistics (primitive types, sketch sizes, degrees of freedom).
- Constraint Inference (“Autoconstrain”): Models predict plausible constraint sets from unconstrained sets of primitives. An autoregressive message-passing network achieves precision and recall of 0.74 (F1 = 0.71) and NLL of 0.495 bits per edge, aligning with human design correction and autocompletion workflows.
Machine learning applications enabled by SketchGraphs include sketch autocompletion, automated constraint suggestion, CAD model inference from raster and hand-drawn images, and learning of semantic latent representations for search, recommendation, and interactive assistance.
4. Advances in Generative and Inference Models
Several models leverage SketchGraphs for both foundational research and state-of-the-art benchmarks:
- Vitruvion (Seff et al., 2021): Trains separate autoregressive transformers for primitives and constraints, using tokenization and quantization of continuous parameters. The factorization supports conditional synthesis, such as sketch completion from images or partial inputs. Models are compatible with native CAD constraint graphs for seamless downstream editing and solving.
- DAVINCI (Karadeniz et al., 30 Oct 2024): Introduces a single-stage, end-to-end encoder/decoder transformer that predicts both parameterized primitives and constraints directly from raster images. Key contributions include the use of “Constraint-Preserving Transformations” (CPTs) as a data augmentation mechanism, resulting in CPTSketchGraphs (80 million augmented sketches) to facilitate data efficiency and robust generalization, especially in low-data regimes (notably achieving 89% of full-data accuracy using only 0.1% of the native dataset).
- SketchDNN (Chereddy et al., 15 Jul 2025): Pioneers a joint continuous-discrete diffusion process, “Gaussian-Softmax diffusion,” for generative modeling of CAD primitives. This method yields new state-of-the-art results on SketchGraphs, reducing Fréchet Inception Distance (FID) from 16.04 to 7.80 and NLL from 84.8 to 81.33 by addressing the permutation invariance of primitive orderings and the heterogeneity in parameterizations.
5. Relevance to Constraint Programming and Program Synthesis
SketchGraphs positions parametric CAD as a rich instance of constraint programming, where the order and relationships encoded in directly parallel program induction frameworks (Seff et al., 2020). ML models trained on SketchGraphs can be viewed as learning to synthesize “programs” (i.e., ordered sequences of construction and constraint operations) that specify fully constrained, editable geometry. This connection opens opportunities for research that bridges design automation, combinatorial optimization, and neural program synthesis.
6. Data Augmentation and the CPTSketchGraphs Resource
To address the challenge of limited annotated data and promote robust model performance, particularly in industrial scenarios:
- Constraint-Preserving Transformations (CPTs): Applied via FreeCAD-integrated APIs, CPTs generate sketch variants by perturbing primitive parameters within bounding boxes while maintaining the geometric constraints unchanged. This ensures that augmented data lies on the manifold of valid CAD sketches (Karadeniz et al., 30 Oct 2024).
- CPTSketchGraphs: Encompassing 80 million CPT-augmented sketches, this extension provides a diverse resource to support transfer learning, data-efficient training, and further research into invariance and generalization in CAD sketch inference tasks.
7. Technical Details and Formalization
SketchGraphs supports precise mathematical formalism:
- Graph Structure: where encodes primitive types and encodes designer-imposed constraints.
- Autoconstrain Model: Edge prediction factorizes as .
- Message Passing Update: , , with a linear layer and a GRU cell.
- Diffusion Approaches: In SketchDNN, discrete-valued variables use Gaussian-Softmax diffusion, with transitions defined as , and the reverse dynamics specified in log-space before projection by softmax (Chereddy et al., 15 Jul 2025).
8. Impact and Future Directions
SketchGraphs stands as the de facto benchmark for CAD sketch generation, constraint inference, augmentation, and program synthesis research. Its extensible structure, detailed parameterization, and growing ecosystem of augmentation (CPTSketchGraphs) and baseline benchmarks enable rapid progress in both academic and applied CAD automation. Anticipated directions include direct generative modeling of full sketches (including constraints), cross-domain transfer (image-to-CAD, UI/layout design), and continued innovation at the intersection of structured prediction, geometric deep learning, and design intent inference.