cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning (2505.22914v1)

Published 28 May 2025 in cs.CV and cs.LG

Abstract: Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-LLMs (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by LLM training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one.

Summary

The paper "cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning" presents an innovative approach to computer-aided design (CAD) model reconstruction by leveraging a multi-modal input mechanism and advanced training methodologies. Herein, CAD reconstruction refers to the generation of CAD models from various data inputs, including point clouds, multi-view images, and textual descriptions. This approach is significant for the engineering and manufacturing sectors, where precise and editable 3D models are necessary.

Technical Approach

The paper introduces a model named cadrille, which integrates various input modalities to enhance the robustness and generalizability of CAD reconstruction processes. Unlike traditional methods that focus solely on single modalities (e.g., point clouds or images), cadrille employs vision-LLMs (VLMs) to simultaneously handle point clouds, images, and texts. It follows a two-stage training pipeline, akin to LLMs:

Supervised Fine-tuning (SFT): The model is initially fine-tuned using large-scale procedurally generated data, allowing it to learn necessary structures for diverse input handling.
Reinforcement Learning (RL): The model undergoes further refinement using online RL techniques, specifically Group Relative Preference Optimization (GRPO), which prove to be more effective than offline methods.

Numerical Results

Empirical evaluations demonstrate significant performance enhancements across multiple benchmarks. The cadrille model sets new state-of-the-art results in the DeepCAD benchmark, outperforming single-modal approaches across all input modalities. Moreover, after the RL fine-tuning stage, it achieves remarkable results on three challenging datasets, including real-world CC3D datasets, indicating its robustness in practical scenarios.

Key Contributions

Multi-modal Processing: Cadrille stands out by being capable of processing point clouds, images, and text in a unified framework, thus providing substantial improvements in reconstruction quality and robustness.
Novel Training Paradigm: Introducing RL fine-tuning to LLMs in the context of CAD reconstruction showcases how traditional models can be adapted for more complex and real-world applications.
Enhanced Accuracy and Robustness: The proposed methods significantly reduce invalid predictions and improve IoU scores across varied datasets, demonstrating the model's ability to generalize effectively across different domains.

Implications and Future Research

The innovations presented by cadrille could lead to broader applications in fields where CAD models are pivotal, such as product design, automotive, aerospace, and construction industries. With improved robustness and efficiency, tools like cadrille could democratize access to complex design applications, making them more accessible to non-experts.

Future research could dive deeper into optimizing LLMs specifically for CAD purposes, integrating more sophisticated feedback mechanisms, or exploring further applications in real-world scenarios. Moreover, adaptations of this framework could guide advancements in other multimodal AI applications by utilizing similar cross-modality processing approaches.

In conclusion, the multi-modal CAD reconstruction approach, enhanced via online RL techniques, signifies a substantial leap forward in the CAD domain's computational methodologies, setting the groundwork for future endeavors in AI-enhanced design platforms.