Effect of large-scale supervised fine-tuning on implicit depth and spatial learning

Determine whether, in the GACO-CAD setting using the Qwen2VL-2B-Instruct multimodal large language model for single-view image-to-CAD code generation, large-scale supervised fine-tuning on RGB images enables implicit learning of accurate depth and spatial relationships such that the marginal performance benefit of incorporating depth and surface normal priors is reduced compared to small-scale fine-tuning.

Background

The paper introduces geometric priors—depth and surface normal maps—into the supervised fine-tuning (SFT) stage to enhance spatial understanding for single-view CAD generation. Quantitative results show consistent improvements over baselines, with notably larger gains from geometric priors in the small-scale (≈80k) SFT setting than in the large-scale (≈1M) SFT setting.

To explain this pattern, the authors explicitly conjecture that large-scale SFT may enable the model to implicitly learn depth and spatial relationships from RGB inputs, thereby diminishing the marginal advantage of adding explicit geometric priors. This hypothesis links training scale to the necessity and impact of geometric modalities beyond RGB.

References

We conjecture that large-scale SFT allows the model to implicitly learn accurate depth and spatial relationships from RGB views, partially reducing the marginal benefit of geometric priors.

GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image  (2510.17157 - Wang et al., 20 Oct 2025) in Experiments, Subsection Performance (Quantitative Results)