Enhancing 3D metric-grounded capabilities of VLMs without sacrificing language reasoning
Establish approaches to improve the 3D metric-grounded spatial understanding and measurement capabilities of vision–language models (VLMs) while simultaneously preserving their natural-language reasoning performance. Specifically, determine training strategies and representations that enable accurate absolute-scale predictions (e.g., depth and distances) and 3D positional sequence generation without degrading general instruction-following and reasoning abilities.
Sponsor
References
These results indicate that enhancing the 3D metric-grounded capabilities of VLMs, while maintaining their language understanding for reasoning, remains an open and challenging problem.
— RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
(2512.13660 - Zhou et al., 15 Dec 2025) in Supplementary, Section "Discussion on Limitations and Future Work" (Sec. suppsec: limitation)