DriveSOTIF: Advancing Perception SOTIF Through Multimodal Large Language Models (2505.07084v1)

Published 11 May 2025 in cs.RO

Abstract: Human drivers naturally possess the ability to perceive driving scenarios, predict potential hazards, and react instinctively due to their spatial and causal intelligence, which allows them to perceive, understand, predict, and interact with the 3D world both spatially and temporally. Autonomous vehicles, however, lack these capabilities, leading to challenges in effectively managing perception-related Safety of the Intended Functionality (SOTIF) risks, particularly in complex and unpredictable driving conditions. To address this gap, we propose an approach that fine-tunes multimodal LLMs (MLLMs) on a customized dataset specifically designed to capture perception-related SOTIF scenarios. Model benchmarking demonstrates that this tailored dataset enables the models to better understand and respond to these complex driving situations. Additionally, in real-world case studies, the proposed method correctly handles challenging scenarios that even human drivers may find difficult. Real-time performance tests further indicate the potential for the models to operate efficiently in live driving environments. This approach, along with the dataset generation pipeline, shows significant promise for improving the identification, cognition, prediction, and reaction to SOTIF-related risks in autonomous driving systems. The dataset and information are available: https://github.com/s95huang/DriveSOTIF.git

Collections

Summary

Advancing Safety of the Intended Functionality through LLMs in Autonomous Driving

The paper "DriveSOTIF: Advancing Perception SOTIF Through Multimodal LLMs" presents a novel approach to enhancing the Safety of the Intended Functionality (SOTIF) of autonomous driving systems through the application of Multimodal LLMs (MLLMs). Recognizing the capabilities of human drivers in perceiving, predicting, and responding to complex and dynamic driving environments, this research seeks to bridge the gap in autonomous vehicles by leveraging advanced AI methodologies.

Model and Methodology

The authors propose the development of a specialized dataset and a fine-tuning process for MLLMs to address perception-related SOTIF risks in autonomous driving. They introduce the DriveSOTIF dataset, which is tailored to capture the nuances of safety-critical driving scenarios. This dataset enables the MLLMs to better understand and react to various complex driving situations that could pose risks due to limitations in perception abilities inherent in autonomous systems.

The process involves fine-tuning MLLMs on the DriveSOTIF dataset using Parameter-Efficient Fine-Tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA). These approaches are chosen for their ability to adapt large-scale models with reduced computational overhead, making them suitable for real-world applications where system resources are constrained.

Evaluation and Results

The fine-tuned models demonstrated improved performance over baseline models in both image captioning and visual question answering (VQA) tasks. The fine-tuned Blip2 6.7B model achieved significant gains in metrics such as ROUGE-L, CIDEr, and SPICE, indicating enhancements in generating rich, accurate, and contextually relevant descriptions of driving scenarios. For VQA, the LLaVA 1.5 model, after fine-tuning, showed marked improvements, notably in the BLEU-4 score, which increased by 146.95%.

A real-world case paper further validated the efficacy of the proposed approach. Fine-tuned MLLMs adeptly handled complex scenarios that involved adverse weather conditions and unexpected road objects—situations that typically challenge conventional autonomous driving perception systems. The insights from the model's responses showcase its potential utility in real-time driving environments, where perception-related risks are prevalent.

Implications

The integration of MLLMs into SOTIF risk assessment and mitigation processes provides a framework for improving the safety and reliability of autonomous driving systems. By equipping these systems with enhanced perception and reasoning capabilities, the research addresses critical gaps in SOTIF, particularly in environments characterized by uncertainty and unpredictability.

The implications of this research are twofold:

Practical: It offers a pathway for deploying advanced AI systems in the field of autonomous vehicles, allowing for more responsive and adaptive navigation strategies.
Theoretical: It lays the groundwork for further exploration into the application of MLLMs in safety-critical autonomous driving applications, paving the way for future research into AI-driven risk assessment and decision-making mechanisms.

Future Directions

Future research could expand upon this work by integrating additional sensor modalities such as LiDAR and radar data into the MLLM framework. Another avenue for exploration involves developing methods to reduce model hallucinations and improve the interpretability of AI systems in decision-critical contexts. The adaptation of lightweight models optimized for deployment in embedded systems also holds promise for enhancing the operational efficiency of autonomous driving platforms.

Overall, the insights and methodologies presented in this paper highlight the potential for MLLMs to transform how autonomous vehicles perceive and respond to their environments, ultimately contributing to safer roadways and more reliable autonomous systems.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (9)

GitHub

GitHub - s95huang/DriveSOTIF: Official repo for DriveSOTIF: Advancing Perception SOTIF Through Multi-Modal Large Language Models