Optimizing data mixtures for training the navigation VLM
Develop principled methods to select and weight heterogeneous data sources—such as SCAND, CODa, TartanDrive 2, an in-domain Spot dataset, human sketch augmentations, egocentric videos with CoTracker-derived trajectories, and iPhone ARKit-labeled videos—when fine-tuning the high-level vision-language model for 2D path prediction in Vamos, so as to improve generalization and avoid the performance degradation observed with naive inclusion or uniform subsampling of these sources.
References
We leave finding better ways to select data mixes from diverse sources such as the ones described in this section for future work.
— VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation
(2510.20818 - Castro et al., 23 Oct 2025) in Appendix: High-Level Training Details – Dataset Preparation and Mixtures (end of section)