DeepCAD-240: CAD Benchmark for Long Sequences
- DeepCAD-240 is a large-scale benchmark dataset for long-form parametric CAD sequence generation, featuring up to 240 sketch–extrusion operations.
- It employs a methodology that mines the ABC dataset with custom FeatureScript routines and advanced tokenization to ensure semantic and syntactic consistency.
- Comparative analysis reveals that models like GeoFusion-CAD, using geometric state space diffusion, deliver superior accuracy and efficiency on complex CAD tasks.
DeepCAD-240 is a large-scale benchmark dataset and evaluation protocol for long-form parametric Computer-Aided Design (CAD) sequence generation, introduced to stress-test generative models on command sequences ranging from moderate to extreme length (up to 240 sketch–extrusion operations). It builds upon and extends the DeepCAD line of benchmarks to provide a quantifiable, rigorous testbed for scalable, structure-aware CAD generation, particularly under hierarchical and long-range dependencies prevalent in industrial 3D modeling tasks (Zhou et al., 23 Mar 2026).
1. Construction and Structure of DeepCAD-240
DeepCAD-240 is constructed by mining the ABC dataset—containing over one million Constructive Solid Geometry (CSG) models—via the Onshape API with custom FeatureScript routines. Each CSG primitive or Boolean tree is algorithmically transformed into an explicit stepwise “sketch–extrusion” history. For each CAD program in DeepCAD-240:
- Sketch Extraction: 2D profiles are tokenized as ordered sequences of primitive types (lines, arcs, circles), with each curve/loop/face/sketch termination marked using explicit tokens (e.g., , , , ).
- Extrusion Parameterization: Each extrusion step is parameterized using Euler angles , translations , scale , distances , and Boolean operation type , finalized with .
- Filtering: Programs are retained if sketches form at least one closed loop, reconstruct as watertight solids, and contain no invalid/redundant operations. This rigorous filtration enforces semantic and syntactic consistency.
A summary of overall dataset statistics:
| Dataset | Total Sequences | Avg. Length | Max. Length | % ≤ 40 | % 40–60 | % 60–80 | % 80–160 | % 160–240 |
|---|---|---|---|---|---|---|---|---|
| DeepCAD | 178,238 | 15 | 60 | 44.6 | 55.4 | — | — | — |
| DeepCAD-240 | 215,914 | 36.2 | 240 | 76.6 | 12.0 | 5.9 | 5.2 | 0.21 |
Table 1. Sequence length statistics in DeepCAD-240 (Zhou et al., 23 Mar 2026).
Programs span a wide range of categories (mechanical, free-form, assembly), and each design step is comprised of a (potentially multi-token) sketch block and an extrusion block, yielding fine-grained structural and parametric expressivity.
2. Tokenization, Vocabulary, and Command Semantics
The DeepCAD-240 token vocabulary unifies structural, sketch, and extrusion abstractions:
- Structural tokens: {pad, cls, , , , , }
- Sketch parameters: (coordinates), (arc curvature), (flip), (circle radius)
- Extrusion parameters: , (distances), (translations), (Euler angles), (scale), (Boolean op type)
Each sketch–extrusion step typically yields 15–40 tokens. On average, approximately half are sketch-primitives and half are extrusion parameterizations. The explicit separation of planarity, curvature, topological, and Boolean semantics supports compositional hierarchies and long-term dependencies, modelling the latent logic of human CAD design.
3. Evaluation Protocols and Metrics
DeepCAD-240 assesses candidate generative models through both procedural and geometric distribution metrics:
- Procedural accuracy:
- Command Type Accuracy
- Parameter Accuracy
- Primitive-type Accuracies: (lines, arcs, circles, extrusions), each computed analogously.
Geometric/distribution metrics:
- Chamfer Distance
- Minimum Matching Distance (MMD)
- Coverage (COV)
- Jensen–Shannon Divergence (JSD) between voxelized distributions
Hausdorff distance and IoU are not reported explicitly; topological validity is gauged via procedural accuracies and watertightness of B-reps (Zhou et al., 23 Mar 2026).
4. Comparative Analysis and Model Baselines
DeepCAD-240’s expanded length and complexity directly expose the scalability limitations of Transformer-based and recurrent generative CAD models developed for prior benchmarks:
| Model | ACC_cmd ↑ | ACC_param ↑ | COV ↑ | MMD ↓ | JSD ↓ | Memory | FLOPs |
|---|---|---|---|---|---|---|---|
| DeepCAD | 75.2 | 72.5 | 64.5 | 1.85 | 4.09 | 8.20GB | 52.8G |
| SkexGen | 81.4 | 78.3 | 68.9 | 1.78 | 3.97 | 11.2GB | 91.2G |
| HNC-CAD | 82.8 | 78.5 | 71.2 | 1.71 | 3.81 | 10.3GB | 87.3G |
| GeoFusion-CAD | 91.2 | 89.3 | 73.9 | 1.12 | 2.97 | 5.20GB | 34.6G |
Table 2. Model comparison under DeepCAD-240 test range (40–240 commands) (Zhou et al., 23 Mar 2026).
- Transformer models (DeepCAD, SkexGen, HNC-CAD) show rapid degradation as sequence length increases (15–20 point drop in command accuracy, lower coverage).
- GeoFusion-CAD employs a geometric state space diffusion framework with linear-time C-Mamba blocks, sustaining high accuracy (91.2%), superior coverage, and substantially reduced memory/FLOPs footprint on long-sequence tasks.
5. Historical Context and Relation to Prior Benchmarks
The original DeepCAD benchmark introduced a fixed-length (N=60) tokenization and transformer-based generative architecture for 3D sketch–extrude CAD programs (Wu et al., 2021). Each model in DeepCAD represents a sketch–extrusion history tokenized into commands (types: SOL, L, A, R, E, EOS), with 16-parameter vectors, and is normalized, quantized, and padded for transformer processing.
DeepCAD-240 extends this paradigm to:
- Maximum program length of 240 (4× extension).
- Higher average and median command sequence lengths (mean 36.2, median ~25).
- Wider support for hierarchical, nested, and structurally long CAD programs—capturing intricate dependencies across broader semantic and geometric contexts.
A direct implication is that long-range consistency, hierarchical structural modeling, and efficient memory- and compute-scalable architectures are essential for performance on DeepCAD-240 and related real-world industrial use-cases.
6. Impact on Model Development and Benchmarks
By providing a benchmark with true long-tail sequence distribution and explicit sketch–extrude semantic structure, DeepCAD-240 enables:
- Precise disambiguation of generative failures arising from lost context, memory saturation, or architectural inefficiency.
- Quantitative comparison of next-generation models that move beyond transformer architectures, e.g., diffusion in geometric state space (GeoFusion-CAD), Mamba blocks, etc.
- Rigorous evaluation of command, parameter, and geometric output fidelity in a realistic industrial setting.
Empirical results demonstrate that advances at the architectural level (e.g., GeoFusion-CAD) yield marked gains in both performance and efficiency over transformers, especially as sequence length, hierarchy, and semantic complexity scale (Zhou et al., 23 Mar 2026).
7. Significance and Future Directions
DeepCAD-240 formalizes the challenge of long-sequence, structure-aware, and topologically complex CAD program generation. It highlights the necessity for:
- Scalable architectures capable of maintaining long-range procedural and geometric consistency.
- Richer tokenization and semantic abstractions aligned with real-world CAD modeling practices.
- Comprehensive evaluation protocols capturing both procedural and geometric fidelity.
A plausible implication is that benchmarks such as DeepCAD-240 will become standard in the evaluation pipeline for neural program synthesis, reverse engineering, and foundation models targeting AI-aided design, 3D manufacturing, and engineering automation domains. Researchers are expected to leverage datasets of this scope to iterate on model architectures and training paradigms that generalize to even longer, more complex CAD programs.