Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 33 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 75 tok/s Pro

Kimi K2 220 tok/s Pro

GPT OSS 120B 465 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Automated Data Curation Using GPS & NLP to Generate Instruction-Action Pairs for Autonomous Vehicle Vision-Language Navigation Datasets (2505.03174v1)

Published 6 May 2025 in cs.RO, cs.CV, and cs.LG

Abstract: Instruction-Action (IA) data pairs are valuable for training robotic systems, especially autonomous vehicles (AVs), but having humans manually annotate this data is costly and time-inefficient. This paper explores the potential of using mobile application Global Positioning System (GPS) references and NLP to automatically generate large volumes of IA commands and responses without having a human generate or retroactively tag the data. In our pilot data collection, by driving to various destinations and collecting voice instructions from GPS applications, we demonstrate a means to collect and categorize the diverse sets of instructions, further accompanied by video data to form complete vision-language-action triads. We provide details on our completely automated data collection prototype system, ADVLAT-Engine. We characterize collected GPS voice instructions into eight different classifications, highlighting the breadth of commands and referentialities available for curation from freely available mobile applications. Through research and exploration into the automation of IA data pairs using GPS references, the potential to increase the speed and volume at which high-quality IA datasets are created, while minimizing cost, can pave the way for robust vision-language-action (VLA) models to serve tasks in vision-language navigation (VLN) and human-interactive autonomous systems.