Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Multimodal Framework for Explainable Autonomous Driving: Integrating Video, Sensor, and Textual Data for Enhanced Decision-Making and Transparency (2507.07938v1)

Published 10 Jul 2025 in cs.MM

Abstract: Autonomous vehicles (AVs) are poised to redefine transportation by enhancing road safety, minimizing human error, and optimizing traffic efficiency. The success of AVs depends on their ability to interpret complex, dynamic environments through diverse data sources, including video streams, sensor measurements, and contextual textual information. However, seamlessly integrating these multimodal inputs and ensuring transparency in AI-driven decisions remain formidable challenges. This study introduces a novel multimodal framework that synergistically combines video, sensor, and textual data to predict driving actions while generating human-readable explanations, fostering trust and regulatory compliance. By leveraging VideoMAE for spatiotemporal video analysis, a custom sensor fusion module for real-time data processing, and BERT for textual comprehension, our approach achieves robust decision-making and interpretable outputs. Evaluated on the BDD-X (21113 samples) and nuScenes (1000 scenes) datasets, our model reduces training loss from 5.7231 to 0.0187 over five epochs, attaining an action prediction accuracy of 92.5% and a BLEU-4 score of 0.75 for explanation quality, outperforming state-of-the-art methods. Ablation studies confirm the critical role of each modality, while qualitative analyses and human evaluations highlight the model's ability to produce contextually rich, user-friendly explanations. These advancements underscore the transformative potential of multimodal integration and explainability in building safe, transparent, and trustworthy AV systems, paving the way for broader societal adoption of autonomous driving technologies.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.