CadQuery Code Generation
- CadQuery code generation is the automated conversion of user design intents into executable, parametric CAD scripts using Python and advanced AI techniques.
- It employs multi-stage pipelines, proactive clarification, and self-refinement loops to enhance code validity and maintain strict geometric fidelity.
- Robust datasets, reinforcement learning objectives, and metrics like Chamfer Distance and IoU critically drive improvements in performance and reliability.
CadQuery code generation refers to the automated synthesis of executable CadQuery scripts—Python-based, parametric CAD programs—from diverse human inputs such as natural language, images, or engineering drawings. This domain draws upon advances in LLMs, reinforcement learning, computer vision, and geometric reasoning, with the central goal of bridging the gap between intuitive design intent and precise, manufacturable 3D models. Research in this area consistently emphasizes workflow reliability, geometric fidelity, code validity, and the seamless integration of data-driven reasoning with the symbolic constraints inherent to mechanical design tasks.
1. Problem Formulation and Challenges
CadQuery code generation takes as input a user-specified intent—typically in natural language or image form—and produces a Python script utilizing CadQuery's fluent API to build a parametric solid model. Unlike general text-to-code scenarios, the problem is exacerbated by (a) strict geometric constraints, (b) the need for numerical precision, and (c) the expressiveness of the underlying design language. Major challenges include:
- Specification ambiguity: Free-form prompts often omit critical parameters or contain conflicting requirements, leading to under- or over-constrained problems (Yuan et al., 3 Feb 2026).
- High code invalidity rates: Models untrained for CAD semantics frequently generate syntactically or semantically invalid code, resulting in non-executable scripts (Xie et al., 10 May 2025, Yuan et al., 3 Feb 2026).
- Geometric fidelity evaluation: Token-level losses do not adequately capture 3D shape resemblance between outputs and ground truth; mesh-based or point-cloud metrics such as Chamfer Distance and Intersection-over-Union (IoU) are preferred (Guan et al., 26 May 2025, Niu et al., 29 Dec 2025, Xie et al., 10 May 2025).
- Multi-step reasoning and constraint satisfaction: Complex designs require decomposition, intermediate CoT (Chain-of-Thought) reasoning, workplane management, and integration of multiple primitives and Boolean operations (Niu et al., 13 Aug 2025, Niu et al., 29 Dec 2025).
2. End-to-End System Architectures
Recent state-of-the-art systems have introduced multi-agent, multi-stage pipelines reflecting the engineering process:
- Proactive clarification pipelines: ProCAD implements a two-agent architecture: a clarifying agent detects ambiguities and unresolvable constraints, interactively querying the user before code synthesis; a coding agent then translates the clarified specification into CadQuery code. This pipeline drastically reduces both mean Chamfer distance and program invalidity rate compared to direct end-to-end generation (Yuan et al., 3 Feb 2026).
- Multi-expert collaborative learning: CME-CAD utilizes a heterogeneous multi-expert setup, where several large models (with distinct system prompts and knowledge bases) collaboratively produce reasoning traces and candidate CadQuery scripts. Knowledge is transferred via KL-divergence minimization and hard-negative buffer replay, culminating in a unified policy that is robust across input modalities (Niu et al., 29 Dec 2025).
- Self-refinement and feedback-driven loops: Frameworks such as Query2CAD and Text-to-CadQuery incorporate self-refinement mechanisms, using feedback (rendered images, vision-LLMs, or human-in-the-loop correction) to iteratively repair and improve generated scripts post hoc, addressing both syntactic errors and semantic mismatches (Badagabettu et al., 2024, Xie et al., 10 May 2025).
- Chain-of-Thought-augmented models: Multiple works explicitly prepend a CoT or reasoning prefix to the prompt, inducing the LLM to decompose the modeling task into subgoals before emitting CAD code, measurably increasing execution and accuracy metrics (Niu et al., 13 Aug 2025, Guan et al., 26 May 2025, Niu et al., 29 Dec 2025).
3. Data Curation and Annotation Pipelines
Dataset scale, quality, and annotation pipelines are critical determinants of generation performance:
- Large-scale pairings: Representative datasets include GenCAD-Code (163k image–CadQuery pairs) (Doris et al., 20 May 2025), ExeCAD (16.5k routines with NL and structured design specs) (Niu et al., 13 Aug 2025), a 170k sample text–CadQuery set constructed via JSON-to-CadQuery prompting with self-correction (Xie et al., 10 May 2025), and the CADExpert benchmark (17,299 orthographic projections with expert annotation) (Niu et al., 29 Dec 2025).
- Automated and semi-automated annotation: Annotation typically entails rendering a 3D model from a baseline representation (sketch–extrude, JSON command sequence), auto-generating one or more CadQuery candidates via LLMs, enforcing execution success, and selecting the candidate that minimizes Chamfer Distance vs. ground truth. Post-processing includes leakage filtering (no API names in prompt), completeness checks, and periodic manual review for ambiguous or failed cases (Xie et al., 10 May 2025, Yuan et al., 3 Feb 2026).
- CoT and hard case augmentation: High-quality CoT reasoning traces are preferentially added to the hardest, high-Chamfer-difference samples, resulting in fine-tuning datasets with explicit planning supervision (Guan et al., 26 May 2025, Niu et al., 29 Dec 2025).
- Continuous benchmark evolution: Standardized metrics, code format skeletons, and multi-modality (NL, sketches, rendered meshes) allow robust cross-system evaluation and ablation (Niu et al., 29 Dec 2025).
4. Training, Objective Functions, and Optimization Techniques
The learning strategy for CadQuery code generation comprises supervised fine-tuning (SFT), reinforcement learning (RL) with custom geometric rewards, collaborative expert learning, and refinement-stage loss components:
- Supervised fine-tuning (SFT): Models are initially trained to maximize the log-likelihood of ground-truth code (and optionally CoT) given the input, leveraging transformer LLMs such as Qwen2.5-7B-Instruct, Vicuna, Gemma-3B, and Mistral-7B (Xie et al., 10 May 2025, Doris et al., 20 May 2025, Guan et al., 26 May 2025).
- Reinforcement learning with geometric reward: RL objectives are constructed using mesh-based Chamfer Distance or IoU between generated and target geometries, often gated by code executability; some methods incorporate policy-gradient or GRPO (Group Reward Policy Optimization) with token-level or group-level advantage estimation (Niu et al., 13 Aug 2025, Niu et al., 29 Dec 2025, Guan et al., 26 May 2025).
- Collaborative and multi-expert RL: CME-CAD jointly trains multiple expert branches, applying per-expert advantage estimation and KL-divergence penalties to maximize knowledge transfer and diversity in reasoning/coding strategies (Niu et al., 29 Dec 2025).
- Specialized optimizations: Trust Region Stretch relaxes PPO clipping for better exploration; Precision Token Loss up-weights the importance of numeric or geometry-critical tokens; overlong (truncated) sequences are filtered from RL updates to avoid noisy supervision (Niu et al., 13 Aug 2025).
- Format and semantic alignment rewards: Output format correctness (e.g., presence of CoT block and code block) and external semantic evaluation (using LLMs or vision-LLMs) provide additional RL reward channels (Niu et al., 13 Aug 2025, Guan et al., 26 May 2025).
5. Evaluation Metrics and Empirical Results
Model output is evaluated along multiple axes emphasizing executable correctness and geometric fidelity:
| Metric | Definition | Typical Range (SOTA) |
|---|---|---|
| Chamfer Distance (CD) | Sum of squared nearest-neighbor distances between point clouds; lower is better (Xie et al., 10 May 2025, Yuan et al., 3 Feb 2026, Guan et al., 26 May 2025) | Down to 6.5e–3 (CAD-Coder RL) |
| Intersection-over-Union (IoU) | Volumetric overlap of generated vs. ground truth solids (Niu et al., 29 Dec 2025, Niu et al., 13 Aug 2025) | Up to 80.7% (CME-CAD) |
| Invalidity Ratio (IR) | Fraction of scripts that fail to execute or produce invalid CAD (Xie et al., 10 May 2025, Yuan et al., 3 Feb 2026, Guan et al., 26 May 2025) | Down to 0.9% (ProCAD, CAD-RL) |
| Executability Rate | Fraction of outputs that parse and run without error (Doris et al., 20 May 2025, Niu et al., 13 Aug 2025, Niu et al., 29 Dec 2025) | Up to 99.6% (CAD-RL) |
| Top-1 Exact Match | Binary equivalence via LLM or rendered mesh judge (Xie et al., 10 May 2025, Guan et al., 26 May 2025) | Up to 69.3% (Text-to-CadQuery) |
Notable empirical findings include:
- Proactive clarification before code generation (ProCAD) yields a 79.9% reduction in mean Chamfer distance and a reduction in IR from 4.8% to 0.9% compared to top closed-source baselines (Yuan et al., 3 Feb 2026).
- Increasing model scale and leveraging code-pretrained LLMs steadily improve geometric accuracy and reduce syntax errors; Qwen2.5-3B achieves F1 ≈ 0.984, IoU ≈ 0.987 on code generation tasks (Xie et al., 10 May 2025).
- Multi-expert RL delivers significant improvements: CME-CAD achieves IoU 80.7%, IR 1.75%, outperforming single-model and non-collaborative approaches (Niu et al., 29 Dec 2025).
- Self-refinement loops and error-repair stages further raise success rates—a first-iteration self-refinement boosts Query2CAD success by 19.6 percentage points (from 53.6% to 73.2%) (Badagabettu et al., 2024).
6. Representative CadQuery Code Generation Patterns
Generated scripts universally follow CadQuery’s idioms: always starting with import cadquery as cq, instantiating a cq.Workplane (specifying plane and origin), sequencing 2D primitives and operations (e.g., .box(), .circle(), .rect(), .polyline()), applying solid operations (.extrude(), .cut(), .union()), and terminating with assignments to ‘result’ or ‘assembly’. Advanced models handle multi-feature assemblies, CoT reasoning blocks, and geometric constraints.
Examples:
- Simple plate with hole:
1 2 3 |
import cadquery as cq plate = cq.Workplane("XY").box(30, 30, 5) result = (plate.faces(">Z").workplane().circle(5).cutThruAll()) |
- Object with explicit CoT reasoning:
1 2 3 4 5 |
<think> 1. Sketch base rectangle 2. Extrude upward 3. Drill four holes at corners </think> |
1 2 |
import cadquery as cq # ... code here ... |
7. Implementation, Limitations, and Prospective Directions
Research artifacts are largely open-source, supporting straightforward reproduction: installation instructions, pretrained weights, annotation scripts, and inference notebooks are provided by leading groups (Yuan et al., 3 Feb 2026, Doris et al., 20 May 2025, Xie et al., 10 May 2025). Most systems can be run on a single or few A100/H200 GPUs. Practical caveats include:
- Residual geometric errors: Despite sub-percent invalidity rates, Chamfer/Iou gaps to perfectly matching ground truth remain, especially for complex assemblies (Niu et al., 13 Aug 2025).
- Real-world deployment challenges: Generalization to real photographs, out-of-distribution primitives, and complex parameterizations is incomplete, especially outside the synthetic/rendered data regime (Doris et al., 20 May 2025).
- Prompt sensitivity and error modes: Catastrophic failures can result from ambiguous or adversarial prompts; interactive clarification, as in ProCAD’s architecture, significantly mitigates this (Yuan et al., 3 Feb 2026).
- The need for richer, multimodal, and interactive datasets: Future work is suggested in extending datasets to photorealistic renderings, supporting user-drawn sketches, integrating constraint-solving techniques, and accommodating human-in-the-loop correction (Xie et al., 10 May 2025, Badagabettu et al., 2024).
Continued progress in LLM-driven CadQuery code generation is expected to further democratize CAD modeling, enabling rapid, reliable, and automated transition from high-level intent to parametric, editable, and manufacturer-ready 3D designs.