Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models (2401.00988v1)

Published 2 Jan 2024 in cs.CV

Abstract: The rise of multimodal LLMs (MLLMs) has spurred interest in language-based driving tasks. However, existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps, we introduce NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks, where each task demands holistic information (e.g., temporal, multi-view, and spatial), significantly elevating the challenge level. To obtain NuInstruct, we propose a novel SQL-based method to generate instruction-response pairs automatically, which is inspired by the driving logical progression of humans. We further present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features, language-aligned for LLMs. BEV-InMLLM integrates multi-view, spatial awareness, and temporal semantics to enhance MLLMs' capabilities on NuInstruct tasks. Moreover, our proposed BEV injection module is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct demonstrate that BEV-InMLLM significantly outperforms existing MLLMs, e.g. around 9% improvement on various tasks. We plan to release our NuInstruct for future research development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  3. Relation matters: foreground-aware graph-based relational reasoning for domain adaptive object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3677–3694, 2022.
  4. Coda: Generalizing to open and unseen domains with compaction and disambiguation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
  5. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. 2023.
  8. Chris J Date. A Guide to the SQL Standard. Addison-Wesley Longman Publishing Co., Inc., 1989.
  9. Talk2car: Taking control of your self-driving car. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2088–2098, 2019.
  10. Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving. arXiv preprint arXiv:2310.02251, 2023.
  11. Hilm-d: Towards high-resolution understanding in multimodal large language models for autonomous driving. arXiv preprint arXiv:2309.05186, 2023.
  12. Jeffrey EF Friedl. Mastering regular expressions. ” O’Reilly Media, Inc.”, 2006.
  13. G-llava: Solving geometric problem with multi-modal large language model, 2023.
  14. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  15. 3d-llm: Injecting the 3d world into large language models. arXiv preprint arXiv:2307.12981, 2023.
  16. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
  17. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  18. Clip2point: Transfer clip to point cloud classification with image-depth pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023.
  19. Textual explanations for self-driving vehicles. In Proceedings of the European conference on computer vision (ECCV), pages 563–578, 2018.
  20. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023a.
  21. Boosting weakly-supervised temporal action localization with text information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 106, 2023b.
  22. Etc: Temporal boundary expand then clarify for weakly supervised video grounding with multimodal large language model. arXiv preprint arXiv:2312.02483, 2023c.
  23. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022a.
  24. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023d.
  25. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023e.
  26. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022b.
  27. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  28. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  29. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2774–2781. IEEE, 2023b.
  30. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  31. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  32. Vision-centric bev perception: A survey. arXiv preprint arXiv:2208.02797, 2022.
  33. Drama: Joint risk localization and captioning in driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1043–1052, 2023.
  34. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
  35. OpenAI OpenAI. Gpt-4 technical report. 2023.
  36. Parameter-efficient image-to-video transfer learning for action recognition. Preprint at https://arxiv. org/abs/2206.13559, 2022.
  37. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  38. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
  39. Detgpt: Detect what you need via reasoning, 2023a.
  40. Perceptiongpt: Effectively fusing visual perception into llm, 2023b.
  41. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. arXiv preprint arXiv:2305.14836, 2023.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  43. Learning to drive safely: Reasonable expectations and future directions for the learner period. Safety, 2(4), 2016.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  45. Object referring in videos with language and human gaze. Cornell University - arXiv,Cornell University - arXiv, 2018.
  46. Referring multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14633–14642, 2023a.
  47. Language prompt for autonomous driving. arXiv preprint arXiv:2309.04379, 2023b.
  48. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023.
  49. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  50. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  51. Unim2ae: Multi-modal masked autoencoders with unified 3d representation for 3d perception in autonomous driving. arXiv preprint arXiv:2308.10421, 2023.
Citations (16)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com