Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 129 tok/s Pro

GPT OSS 120B 430 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models (2505.23757v1)

Published 29 May 2025 in cs.CV

Abstract: Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.

Summary

The paper introduces the Impromptu VLA Dataset, an expansive collection of annotated video clips designed to improve Vision-Language-Action models by focusing on challenging unstructured driving scenarios.
Experiments demonstrate that VLA models trained on Impromptu VLA show significant performance improvements in closed-loop metrics like NeuroNCAP scores and collision rates, indicating safer driving policies.
This dataset addresses a critical gap in driving data for unstructured environments, providing resources necessary to train robust autonomous systems capable of navigating complex, unpredictable real-world conditions.

Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models

The paper "Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models" presents a significant advancement to address challenges faced by autonomous driving systems in unstructured corner case scenarios. Here, the authors introduce the Impromptu VLA Dataset, an expansive collection designed to bolster Vision-Language-Action (VLA) models by focusing on four critical types of unstructured scenarios: roads with unclear boundaries, temporary traffic rule changes, unconventional dynamic obstacles, and challenging road conditions.

Technical Contributions

Dataset Creation: The Impromptu VLA Dataset compiles over 80,000 video clips from eight large-scale open-source datasets initially totaling over two million clips. The dataset is carefully structured and annotated with planning-oriented question-answering pairs and action trajectories. These annotations are critical for VLA tasks such as scene understanding, prediction, meta-planning, and trajectory planning.
Enhanced Model Performance: The authors demonstrate that VLA models trained on the Impromptu VLA Dataset show marked improvements in performance metrics on established benchmarks. Notably, models trained with this dataset achieve higher NeuroNCAP scores and lower collision rates in closed-loop evaluations. They also demonstrate competitive performance in trajectory prediction accuracy compared to state-of-the-art models.

Experimental Findings

The experimental setup rigorously evaluates the benefits of pre-training models on the Impromptu VLA Dataset followed by fine-tuning on the nuScenes dataset. Key results include:

Closed-loop Improvements: A performance leap is observed in the NeuroNCAP benchmark, where the model’s average score increases significantly, coupled with a critical reduction in collision rates. This suggests that the dataset enables the model to develop a more nuanced understanding of complex road interactions, leading to safer driving policies.
Open-loop Performance: In trajectory prediction tasks, models fine-tuned using Impromptu VLA data achieve L2 error closest to leading methods, showcasing its efficacy despite training on fewer data points compared to competitors who utilize larger, proprietary datasets.

Implications and Future Directions

The Impromptu VLA Dataset fills the critical gap in driving data resources that focus on unstructured environments, pushing the boundaries of VLA models beyond structured urban settings. This contributes significantly to autonomous driving research by providing the necessary data to train robust models capable of navigating complex, unpredictable scenarios.

Looking forward, further exploration might include expanding the dataset's scope to incorporate additional environmental contexts or integrating it with synthetic data to simulate rare yet crucial driving conditions. Moreover, research could explore leveraging multi-modal sensor data alongside this augmented VLA paradigm to enable comprehensive situational awareness for autonomous vehicles.

In conclusion, the paper presents a well-curated dataset that not only enhances model accuracy in challenging driving conditions but also serves as a crucial diagnostic tool for evaluating perception, prediction, and planning capabilities. This supports the broader aim of developing dependable autonomous driving systems equipped to handle real-world complexities.