- The paper introduces AutoTrust, a comprehensive framework to assess VLM trustworthiness using metrics like factual accuracy and uncertainty.
- It employs an extensive dataset with over 10,000 scenes and 18,000 queries to compare generalist and specialized models.
- Key findings reveal vulnerabilities in safety, robustness, privacy, and fairness, underscoring the need for improved safeguards in autonomous systems.
Trustworthiness Benchmarking in Vision LLMs for Autonomous Driving
The paper, "AutoTrust: Benchmarking Trustworthiness in Large Vision LLMs for Autonomous Driving," presents an evaluation framework for assessing the trustworthiness of large vision LLMs (VLMs) specifically tailored for autonomous driving applications. It introduces a comprehensive benchmark named AutoTrust, designed to evaluate a series of VLMs across five critical dimensions: trustfulness, safety, robustness, privacy, and fairness. This delineation is essential as VLMs become increasingly integral in facilitating autonomous driving's perception and decision-making processes.
Key Findings
AutoTrust encompasses a visual question-answering dataset specifically constructed for autonomous driving scenarios, comprising over 10,000 unique scenes and 18,000 queries. Six VLMs were evaluated, spanning both generalist and specialist models and ranging from open-source frameworks to commercial applications. These exhaustive evaluations uncovered several vulnerabilities within DriveVLMs regarding trustworthiness threats.
- Trustfulness: The paper explored both factual accuracy and uncertainty in the model's predictions. Generalist models such as LLaVA-v1.6 and GPT-4o-mini outperformed specialized DriveVLMs in both open-ended and closed-ended questions. Despite DriveVLMs showing moderate to low performance, they exhibited significant factuality hallucinations across diverse datasets. Moreover, DriveVLMs typically lacked confidence, reflecting lower over-confidence ratios than their generalist counterparts.
- Safety: The robustness of VLMs was examined under numerous conditions, including adversarial white-box and black-box attacks, misinformation scenarios, and malicious prompts. Larger VLMs generally exhibited weaker robustness to white-box attacks but demonstrated stronger resistance to black-box transfer attacks and misinformation prompts. Nevertheless, the models showed disparate vulnerabilities across different testing scenarios, indicative of the need for stringent safeguards in real-world autonomous driving systems.
- Robustness: DriveVLMs were found to have significant robustness issues, particularly under out-of-distribution (OOD) scenarios. The models' performance invariably declined when faced with data exhibiting natural variations such as weather changes, nighttime conditions, and linguistic perturbations. Generalist models generally demonstrated superior performance due to their extensive training datasets, underscoring the importance of diverse training data exposure.
- Privacy: Evaluations of privacy leakage were conducted around sensitive information such as identifiable individual information and location privacy. DriveVLMs showed high susceptibility to privacy infringement under zero-shot and privacy-leakage prompting, contrasting with their improved resistance when privacy-protection prompts were applied. A positive correlation was observed between model size and recognition accuracy in safe-handling scenarios.
- Fairness: Evaluations indicated significant bias in the models' interactions with diverse demographic and environmental attributes. Most VLMs depicted inconsistency across various demographic groups, leading to potential unfair decision-making. Notably, larger models demonstrated better adaptability and fairness across differing scenarios, emphasizing the need for ethical considerations in model training.
Implications and Future Developments
The insights generated from AutoTrust establish a pivotal foundation for addressing the trustworthiness of VLMs in autonomous driving. Practically, these findings can inform the development of more robust, privacy-conscious, and fair models, crucial for ensuring public safety in real-world autonomous transportation systems. Theoretically, the paper underpins the need for continuous model evaluation against novel adversarial threats and diverse environmental conditions, ensuring their adaptability and reliability.
Future AI advancements are likely to enhance VLM capabilities, allowing for more nuanced and context-aware autonomous systems that can effectively blend resilience with innovation. The exploration of diverse training datasets, embeddings from real-world driving conditions, and improved interpretative algorithms will be seminal in elevating VLM performance.
AutoTrust provides a comprehensive scaffold through which the efficacy and trustworthiness of DriveVLMs can be evaluated and improved, ultimately contributing to safer and more equitable autonomous driving solutions.