Optimizing data mixtures for training the navigation VLM

Develop principled methods to select and weight heterogeneous data sources—such as SCAND, CODa, TartanDrive 2, an in-domain Spot dataset, human sketch augmentations, egocentric videos with CoTracker-derived trajectories, and iPhone ARKit-labeled videos—when fine-tuning the high-level vision-language model for 2D path prediction in Vamos, so as to improve generalization and avoid the performance degradation observed with naive inclusion or uniform subsampling of these sources.

Background

The paper trains a high-level vision-language planner (based on PaliGemma 2) to predict 2D navigation paths from images and text-encoded goals, using a curated mixture of robot navigation datasets (SCAND, CODa, TartanDrive 2, and a small in-domain Spot dataset) plus human sketch augmentations. The authors found that heuristic down-weighting and filtering were necessary to achieve strong results.

They experimented with additional non-robot sources—egocentric videos processed via CoTracker and smartphone videos labeled with ARKit odometry—but naive inclusion of these data hurt performance. Although these sources are scalable and promising for multi-modal training, the authors did not identify an optimal mixture strategy and cite prior work on data mixture optimization as a potential direction.

Consequently, determining better ways to select and weight diverse data sources for fine-tuning the high-level VLM planner remains unresolved and is explicitly deferred to future work.

References

We leave finding better ways to select data mixes from diverse sources such as the ones described in this section for future work.

— VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation (2510.20818 - Castro et al., 23 Oct 2025) in Appendix: High-Level Training Details – Dataset Preparation and Mixtures (end of section)

Optimizing data mixtures for training the navigation VLM

Sponsor

Background

References

Related Problems