Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 166 tok/s Pro

GPT OSS 120B 436 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

CarLLaVA: Vision language models for camera-only closed-loop driving (2406.10165v1)

Published 14 Jun 2024 in cs.CV and cs.RO

Abstract: In this technical report, we present CarLLaVA, a Vision LLM (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a camera-only driving approach that achieves a 458% performance boost over previous methods on the CARLA simulator.
It presents a semi-disentangled output design combining time-conditioned and space-conditioned waypoints for precise longitudinal and lateral control.
The study demonstrates a cost-effective training strategy using high-resolution image patches and curated challenging driving scenarios.

CarLLaVA: Vision LLMs for Camera-Only Closed-Loop Driving

The paper "CarLLaVA: Vision LLMs for Camera-Only Closed Loop Driving" presents an end-to-end solution for autonomous driving utilizing purely camera-based inputs. This approach represents an advancement in the field as it achieves state-of-the-art performance using the CARLA driving simulator, emphasizing efficiency and cost-effectiveness.

CarLLaVA leverages the vision encoder from the LLaVA Vision LLM (VLM) and the LLaMA architecture to achieve superior performance in autonomous driving tasks. The experiment results demonstrate the model's ranking at the top of the sensor track in the CARLA Autonomous Driving Challenge 2.0, outperforming previous state-of-the-art methods by 458%, and the best concurrent submission by 32.6%.

Key Approach and Innovations

The paper introduces several critical aspects in the methodology to advance autonomous driving solutions:

Camera-Only Input and No Expensive Labels:
- CarLLaVA relies solely on camera images, mitigating the need for labels such as BEV semantics, depth, or segmentation that are typically expensive and labor-intensive to obtain. This approach facilitates a more scalable and economically viable solution for real-world application.
Semi-Disentangled Representation:
- The authors propose a semi-disentangled output representation combining both time-conditioned waypoints for longitudinal control and space-conditioned path waypoints for lateral control. This architecture enhances the model's ability to perform accurate lateral and longitudinal maneuvers without the necessity of heuristic adjustments.
Vision-Language Pretraining and High-Resolution Input:
- CarLLaVA utilizes the vision encoder of LLaVA-NeXT pre-trained on internet-scale visual-language data, thereby harnessing the robust feature extraction capabilities inherent in VLMs. The input images are divided into high-resolution patches, improving the recognition of fine details crucial for driving, such as distant traffic lights.
Efficient Training Recipe:
- An efficient training strategy is presented that minimizes computational resources required by focusing on non-trivial data samples. The training data is curated to enhance variety and emphasize interesting scenarios, ensuring that the training focuses on challenging and rare situations rather than redundant and simple driving data.

Results and Performance

Table 1 from the paper illuminates CarLLaVA's exceptional performance metrics under various assessment criteria on the CARLA Leaderboard 2.0. It surpasses notable methods significantly in terms of Driving Score (DS), Route Completion (RC), and Infraction Score (IS). Moreover, the ablation studies, particularly the output representation (Table 2a) and vision encoder pretraining (Table 2b), highlight the importance and efficacy of the semi-disentangled output representation and internet-scale pretraining.

Implications and Future Directions

The implications of this research are multifaceted, involving both theoretical insights and practical enhancements:

Theoretical Implications:
- This work underscores the strength of VLMs when transferred to domain-specific tasks like autonomous driving. The utilization of semi-disentangled representations advances understanding in balancing control mechanisms within a singular architectural framework.
Practical Enhancements:
- The findings encourage the broader adoption of cost-effective, camera-only autonomous driving solutions by demonstrating superior performance without dependency on additional expensive sensory inputs. This approach can accelerate the deployment of autonomous driving technologies in a real-world context where cost and scalability remain significant constraints.

Speculating on future developments, further enhancements might involve integrating richer temporal dynamics and multi-view camera systems to strengthen situational awareness and decision-making under complex driving conditions. Additionally, the fusion of VLMs with other modalities, such as radar and LiDAR, while ensuring cost-effectiveness, can introduce new dimensions of robustness and reliability.

Conclusion

The CarLLaVA framework presents a notable advancement in autonomous driving, rendering closed-loop driving using only camera inputs and achieving state-of-the-art results on the CARLA Leaderboard. The findings significantly contribute to the body of knowledge by emphasizing efficient, scalable, and cost-conscious autonomous driving strategies. Future research can build on these advancements, potentially integrating temporal and multi-view configurations to further bolster performance.