Multimodal End-to-End Autonomous Driving (1906.03199v2)

Published 7 Jun 2019 in cs.CV

Abstract: A crucial component of an autonomous vehicle (AV) is the AI is able to drive towards a desired destination. Today, there are different paradigms addressing the development of AI drivers. On the one hand, we find modular pipelines, which divide the driving task into sub-tasks such as perception and maneuver planning and control. On the other hand, we find end-to-end driving approaches that try to learn a direct mapping from input raw sensor data to vehicle control signals. The later are relatively less studied, but are gaining popularity since they are less demanding in terms of sensor data annotation. This paper focuses on end-to-end autonomous driving. So far, most proposals relying on this paradigm assume RGB images as input sensor data. However, AVs will not be equipped only with cameras, but also with active sensors providing accurate depth information (e.g., LiDARs). Accordingly, this paper analyses whether combining RGB and depth modalities, i.e. using RGBD data, produces better end-to-end AI drivers than relying on a single modality. We consider multimodality based on early, mid and late fusion schemes, both in multisensory and single-sensor (monocular depth estimation) settings. Using the CARLA simulator and conditional imitation learning (CIL), we show how, indeed, early fusion multimodality outperforms single-modality.

Authors (5)

Yi Xiao (49 papers)
Felipe Codevilla (10 papers)
Akhil Gurram (7 papers)
Onay Urfalioglu (11 papers)
Antonio M. López (41 papers)

Citations (207)

View on Semantic Scholar

Summary

The paper demonstrates that multimodal early fusion significantly outperforms single-modality models on benchmark tasks in the CARLA simulator.
The paper employs Conditional Imitation Learning to map combined sensor data directly to driving actions, enhancing adaptability in complex scenarios.
The paper explores cost-effective single-sensor configurations and future integration of additional modalities to improve system reliability and scalability.

Multimodal End-to-End Autonomous Driving

The paper "Multimodal End-to-End Autonomous Driving" investigates the potential enhancements in autonomous vehicle (AV) performance by leveraging multimodal sensor input in end-to-end driving systems. The central focus of this research is determining whether combining RGB camera data with depth information, typically acquired through LiDAR sensors, can improve the efficacy of AI-driven autonomous navigation compared to reliance on a singular modality. The paper further explores the effectiveness of RGBD data integration at various stages through early, mid, and late fusion strategies, with a particular emphasis on end-to-end models deployed in autonomous driving.

Core Concepts

The paper juxtaposes two primary methodologies for developing AI drivers: modular pipelines and end-to-end approaches. Modular systems fragment driving tasks into discrete elements, such as perception, path planning, and control. In contrast, end-to-end approaches aim to map sensor data directly to driving actions without intermediary tasks, effectively reducing the need for annotated training data. Previous works predominantly focused on using RGB images alone; however, this paper tests the hypothesis that incorporating depth data could bolster performance.

Methodological Framework

The research utilizes the CARLA simulator to conduct experiments under various conditions, simulating urban driving environments with differing levels of complexity, dynamic obstacles, and weather patterns. The paper employs Conditional Imitation Learning (CIL) as a framework to manage decision-making using high-level commands. This allows the AV to adapt its driving strategy depending on the specific navigation task required, such as turning at intersections or navigating through traffic.

In terms of fusion strategies:

Early Fusion: Integration of RGB and depth data occurs at the initial layer, forming an RGBD channel that feeds into the neural network.
Mid Fusion: Processes RGB and depth data separately in the initial convolutional layers before combining feature maps at a deeper network layer.
Late Fusion: RGB and depth streams are kept separate through most of the network, combining only at the end to produce the driving command.

Experimental Insights

Strong empirical results show that multimodal early fusion outperforms single-modality models across various benchmark tasks in CARLA, notably in adverse conditions and scenarios with dynamic obstacles. The enhanced model consistently exhibits higher success rates in completing driving tasks compared to its RGB-only or depth-only counterparts. This performance improvement validates the hypothesis that integrating depth information with visual data enhances the system’s environmental comprehension and control precision.

Moreover, the paper extends to investigate a potential single-sensor configuration by exploring RGB imagery with concurrently estimated depth maps. While not outperforming the active sensor configuration, this setup still demonstrates promise and could reduce sensor dependencies and system costs in real-world applications.

Theoretical and Practical Implications

These findings underscore the importance of multimodal approaches for improving the robustness of end-to-end autonomous driving systems. Such strategies open pathways for enhancing perceptual capabilities and decision-making processes in AVs by accurately capturing a wider array of environmental features.

The results suggest future work could focus on refining single-sensor models to minimize reliance on costly LiDAR systems, thus increasing accessibility and scalability of autonomous technologies. Additionally, incorporating other sensory modalities, such as RADAR or GNSS, could further enhance system reliability.

In conclusion, this paper provides significant insights into how multimodal inputs can elevate the field of autonomous driving, proposing strategies that are not only theoretically meaningful but also imperative for practical advancements. The implications of adopting such multifaceted sensory systems could ultimately lead to improved safety and efficiency in the deployment of autonomous vehicles in real-world scenarios.

PDF Markdown