CarLLaVA: Vision language models for camera-only closed-loop driving

Published 14 Jun 2024 in cs.CV and cs.RO | (2406.10165v1)

Abstract: In this technical report, we present CarLLaVA, a Vision LLM (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

Abstract PDF HTML Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper introduces CarLLaVA, a framework that leverages vision-language pretraining to enable efficient camera-only closed-loop autonomous driving.
The model’s semi-disentangled architecture combines time-conditioned waypoint and space-conditioned path predictions to enhance steering and collision avoidance.
Experimental results show a 32.6% performance improvement and first-place ranking in the CARLA challenge, validating its scalable design.

Vision-LLMs for Autonomous Driving: An Examination of CarLLaVA

Introduction

The exploration of Vision-LLMs (VLMs) in autonomous driving has led to innovative advancements characterized by sophisticated integration of visual and linguistic data streams. The paper "CarLLaVA: Vision LLMs for Camera-Only Closed-Loop Driving" (2406.10165) introduces CarLLaVA, a novel framework developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA capitalizes on vision-language pre-training to optimize performance in an end-to-end camera-driven autonomous driving setting without the reliance on costly sensor inputs or labeled data.

Model Architecture

CarLLaVA leverages the LLaVA VLM's vision encoder and integrates the LLaMA architecture for enhanced driving performance. The core architecture involves a semi-disentangled output representation that facilitates both path and waypoint predictions, optimizing lateral and longitudinal vehicle control, respectively. This model utilizes a vision encoder pre-trained on extensive internet-scale datasets to infer essential features from camera inputs, eschewing traditional ResNet-styled ImageNet pre-trained configurations.

Figure 1: CarLLaVA base model architecture. (C1T1) The images are split in two, and each split is independently encoded and then concatenated, downsampled and projected into a pre-trained LLM. The output utilises a semi-disentangled representation with both time-conditioned waypoints and space-conditioned path waypoints for improved lateral control.

Training Methodology

An efficient training regime is employed, focusing on reducing computational redundancy by emphasizing challenging scenarios over trivial driving data. CarLLaVA discards broad-scale labels, allowing the model to depend solely on camera images and easily obtainable driving trajectory data. The employed training method strategically leverages diverse buckets of interesting data samples, thereby optimizing the effectiveness of the learning process and reducing unnecessary computational costs.

Experimental Results

Highlighted by its first-place rank in the CARLA driving challenge's sensor track, CarLLaVA surpasses the preceding state-of-the-art by a significant margin, validating its design. To compare, the model showed a 32.6% improvement over concurrent submissions, demonstrating unparalleled efficiency and adaptability in closed-loop scenarios.

Moreover, the transition from waypoint prediction to path prediction within the semi-disentangled framework markedly enhanced collision mitigation capabilities, underscoring advanced steering behavior across simulated environments.

Figure 2: Qualitative examples of generated language. Red: predicted path, Green: predicted waypoints, Blue: Target Points.

Discussion

The framework explores preliminary applications of language commentary generation, which is indicative of a future trajectory toward more robust multi-modal capabilities in autonomous vehicles. CarLLaVA’s design, which omits extensive label requirements while leveraging vision-language pre-training, renders it a practical candidate for scalable, real-world deployment in resource-constrained settings.

Conclusion

CarLLaVA represents a sophisticated approach to integrating vision and language in autonomous driving. By effectively circumventing the need for expensive sensors and extensive labeled datasets, CarLLaVA offers an efficient, high-performance paradigm for end-to-end driving systems. Moving forward, further exploration of multi-camera and temporal data integration will be crucial for addressing outstanding challenges in high-speed maneuvers and rear-end collision avoidance in complex driving scenarios.