Enhancing 3D metric-grounded capabilities of VLMs without sacrificing language reasoning

Establish approaches to improve the 3D metric-grounded spatial understanding and measurement capabilities of vision–language models (VLMs) while simultaneously preserving their natural-language reasoning performance. Specifically, determine training strategies and representations that enable accurate absolute-scale predictions (e.g., depth and distances) and 3D positional sequence generation without degrading general instruction-following and reasoning abilities.

Background

The paper introduces RoboTracer, a 3D-aware vision–LLM designed for spatial tracing via multi-step, metric-grounded reasoning, including 3D spatial referring and measuring. Despite architectural innovations (a universal spatial encoder, scale decoder, and metric-sensitive process rewards) and a large-scale TraceSpatial dataset, the authors observe that improvements in 3D metric-grounded understanding are limited relative to 2D spatial reasoning gains.

They attribute these limitations to missing elements such as a scene-level 3D representation naturally aligned with language and comprehensive, scale-aware supervision signals. Given this gap, the authors state that strengthening 3D metric-grounded capabilities while maintaining language reasoning remains an open and challenging problem, suggesting future work on better input representations and richer supervision at both input and output levels.

References

These results indicate that enhancing the 3D metric-grounded capabilities of VLMs, while maintaining their language understanding for reasoning, remains an open and challenging problem.

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics (2512.13660 - Zhou et al., 15 Dec 2025) in Supplementary, Section "Discussion on Limitations and Future Work" (Sec. suppsec: limitation)