Effect of large-scale supervised fine-tuning on implicit depth and spatial learning
Determine whether, in the GACO-CAD setting using the Qwen2VL-2B-Instruct multimodal large language model for single-view image-to-CAD code generation, large-scale supervised fine-tuning on RGB images enables implicit learning of accurate depth and spatial relationships such that the marginal performance benefit of incorporating depth and surface normal priors is reduced compared to small-scale fine-tuning.
Sponsor
References
We conjecture that large-scale SFT allows the model to implicitly learn accurate depth and spatial relationships from RGB views, partially reducing the marginal benefit of geometric priors.
— GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image
(2510.17157 - Wang et al., 20 Oct 2025) in Experiments, Subsection Performance (Quantitative Results)