Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving (2403.19838v2)
Abstract: Vision-LLMs (VLMs) and Multi-Modal LLMs (MMLMs) have become prominent in autonomous driving research, as these models can provide interpretable textual reasoning and responses for end-to-end autonomous driving safety tasks using traffic scene images and other data modalities. However, current approaches to these systems use expensive LLM backbones and image encoders, making such systems unsuitable for real-time autonomous driving systems where tight memory constraints exist and fast inference time is necessary. To address these previous issues, we develop EM-VLM4AD, an efficient, lightweight, multi-frame vision LLM which performs Visual Question Answering for autonomous driving. In comparison to previous approaches, EM-VLM4AD requires at least 10 times less memory and floating point operations, while also achieving higher CIDEr and ROUGE-L scores than the existing baseline on the DriveLM dataset. EM-VLM4AD also exhibits the ability to extract relevant information from traffic views related to prompts and can answer questions for various autonomous driving subtasks. We release our code to train and evaluate our model at https://github.com/akshaygopalkr/EM-VLM4AD.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- The falcon series of open language models, 2023.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
- Driving with llms: Fusing object-level vector modality for explainable autonomous driving. arXiv preprint arXiv:2310.01957, 2023.
- Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR, 2021.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936, 2022.
- Compressing visual-linguistic model via knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1428–1438, 2021.
- Towards explainable, safe autonomous driving with language embeddings for novelty identification and active learning: Framework and experimental analysis with real-world data sets. arXiv preprint arXiv:2402.07320, 2024.
- Robust traffic light detection using salience-sensitive loss: Computational framework and evaluations. In 2023 IEEE Intelligent Vehicles Symposium (IV), pages 1–7. IEEE, 2023.
- The why, when, and how to use active learning in large-data-driven 3d object detection for safe autonomous driving: An empirical exploration. arXiv preprint arXiv:2401.16634, 2024a.
- Patterns of vehicle lights: Addressing complexities of camera-based vehicle light datasets and metrics. Pattern Recognition Letters, 178:209–215, 2024b.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Textual explanations for self-driving vehicles. In Proceedings of the European conference on computer vision (ECCV), pages 563–578, 2018.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Distilling large vision-language model with out-of-distribution generalizability. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2492–2503, 2023b.
- Loftq: Lora-fine-tuning-aware quantization for large language models, 2023c.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
- Trajectory prediction for autonomous driving based on multi-head attention with joint agent-map representation. In 2021 IEEE Intelligent Vehicles Symposium (IV), pages 165–170. IEEE, 2021.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- Languagempc: Large language models as decision makers for autonomous driving, 2023.
- Drivelm: Driving with graph visual question answering. arXiv preprint arXiv:2312.14150, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
- Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving, 2023.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Mivc: Multiple instance visual component for visual-language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8117–8126, 2024.
- Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023.
- Akshay Gopalkrishnan (10 papers)
- Ross Greer (41 papers)
- Mohan Trivedi (23 papers)