Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 94 tok/s
GPT OSS 120B 476 tok/s Pro
Kimi K2 190 tok/s Pro
2000 character limit reached

Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru (2503.07587v1)

Published 10 Mar 2025 in cs.CV, cs.AI, and cs.RO

Abstract: As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations -- especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the worst (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual LLMs (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

Analysis of the Robusto-1 Dataset: Evaluating Human and VLM Responses in Autonomous Driving Scenarios

The paper under analysis presents the Robusto-1 dataset, a crucial resource for understanding the differences between human perception and the responses of Visual-LLMs (VLMs) in the context of autonomous driving. This paper leverages out-of-distribution (OOD) driving scenarios from Peru, characterized by unique traffic conditions and atypical street objects, to probe the cognitive alignment between humans and VLMs.

The Robusto-1 dataset moves beyond conventional methods like bounding boxes and trajectory estimation. Instead, it employs a Visual Question Answering (VQA) framework to assess comprehension levels exhibited by both humans and VLMs under challenging driving conditions. By specially curating questions that test for reasoning beyond simple object recognition, the dataset allows for a deeper comparison at a cognitive level. The Representational Similarity Analysis (RSA) is employed to understand where human and machine perceptions align or diverge.

Strong Numerical Results and Bold Claims

  1. Structural Design of the Dataset:
    • Comprising 200 five-second video clips, captured using dashcams in Peru, the dataset is intended to expose AI systems to conditions far removed from those typically encountered in their training contexts. This expansive variety is a crucial factor in testing AI's robustness.
  2. VQA Methodology:
    • The paper conducted involves querying both humans and VLMs with 15 questions per clip, divided into variable, multiple-choice, and counterfactual categories. Each type is designed to assess various perceptual and cognitive tasks.
  3. Findings from RSA:
    • The research underscores a significant variability in alignment based on the question type: VLMs exhibit a high degree of similarity in their responses, especially among themselves, whereas human responses show substantial variability. Notably, humans align closely on multiple-choice questions yet diverge on counterfactual and hypothetical questions, suggesting that human cognitive processing involves nuanced contextual interpretation beyond current VLM capabilities.

Implications

Practical Implications

  • The paper positions the Robusto-1 dataset as a critical tool for testing and enhancing the robustness of VLMs in autonomous driving systems in real-world scenarios that are inherently unpredictable. It can drive improvements in VLM performance by providing training data that are highly reflective of the unexpected events typical of diverse driving environments.

Theoretical Implications

  • The significant mismatch in cognitive alignment between humans and VLMs signals a gap in AI systems' ability to generalize understanding in complex scenarios. This underscores the necessity of integrating more nuanced cognitive models and reasoning capabilities into AI training processes.

Future Developments

  • Future research could focus on integrating advanced behavioral analysis, such as eye-tracking or neural activation studies, with human subjects to develop models that more closely imitate human-like cognition and decision-making processes. Such methodologies may prove to be transformative in developing VLMs capable of safely navigating diverse real-world environments.

In conclusion, the analysis of the Robusto-1 dataset not only highlights the shortcomings in the decision-making processes of current VLMs under OOD scenarios, but also serves as a pivotal step toward developing more resilient and human-aligned autonomous systems. By expanding testing beyond controlled environments and into the real-world chaos found in places like Peru, this research enhances the path towards truly adaptive autonomous systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com