Papers
Topics
Authors
Recent
Search
2000 character limit reached

HE-Drive: Human-Like End-to-End Driving with Vision Language Models

Published 7 Oct 2024 in cs.CV and cs.RO | (2410.05051v1)

Abstract: In this paper, we propose HE-Drive: the first human-like-centric end-to-end autonomous driving system to generate trajectories that are both temporally consistent and comfortable. Recent studies have shown that imitation learning-based planners and learning-based trajectory scorers can effectively generate and select accuracy trajectories that closely mimic expert demonstrations. However, such trajectory planners and scorers face the dilemma of generating temporally inconsistent and uncomfortable trajectories. To solve the above problems, Our HE-Drive first extracts key 3D spatial representations through sparse perception, which then serves as conditional inputs for a Conditional Denoising Diffusion Probabilistic Models (DDPMs)-based motion planner to generate temporal consistency multi-modal trajectories. A Vision-LLMs (VLMs)-guided trajectory scorer subsequently selects the most comfortable trajectory from these candidates to control the vehicle, ensuring human-like end-to-end driving. Experiments show that HE-Drive not only achieves state-of-the-art performance (i.e., reduces the average collision rate by 71% than VAD) and efficiency (i.e., 1.9X faster than SparseDrive) on the challenging nuScenes and OpenScene datasets but also provides the most comfortable driving experience on real-world data.For more information, visit the project website: https://jmwang0117.github.io/HE-Drive/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11621–11631, 2020.
  3. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243, 2024.
  4. Pluto: Pushing the limit of imitation learning-based planning for autonomous driving. arXiv preprint arXiv:2404.14327, 2024.
  5. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 2024.
  6. Neat: Neural attention fields for end-to-end autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  15793–15803, 2021.
  7. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):12878–12895, 2022.
  8. OpenScene Contributors. Openscene: The largest up-to-date 3d occupancy prediction benchmark in autonomous driving. https://github.com/OpenDriveLab/OpenScene, 2023.
  9. Parting with misconceptions about learning-based vehicle motion planning. In Conference on Robot Learning, pp.  1268–1281. PMLR, 2023.
  10. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. arXiv, 2406.15349, 2024.
  11. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  12. Baidu apollo em motion planner. arXiv preprint arXiv:1807.08048, 2018.
  13. An efficient spatial-temporal trajectory planner for autonomous vehicles in unstructured environments. IEEE Transactions on Intelligent Transportation Systems, 2023.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  15. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  16. Safe local motion planning with self-supervised freespace forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12732–12741, 2021.
  17. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision, pp.  533–549. Springer, 2022.
  18. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  17853–17862, 2023a.
  19. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  17853–17862, 2023b.
  20. Vad: Vectorized scene representation for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8340–8350, 2023a.
  21. Vad: Vectorized scene representation for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8340–8350, 2023b.
  22. Differentiable raycasting for self-supervised occupancy forecasting. In European Conference on Computer Vision, pp.  353–369. Springer, 2022.
  23. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation. arXiv preprint arXiv:2406.06978, 2024.
  24. Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16467–16476, 2024.
  25. I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. Potential based diffusion motion planning. arXiv preprint arXiv:2407.06169, 2024.
  27. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  28. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  7077–7087, 2021.
  29. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp.  729–736, 2006.
  30. Lmdrive: Closed-loop end-to-end driving with large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15120–15130, 2024.
  31. Drivelm: Driving with graph visual question answering. arXiv preprint arXiv:2312.14150, 2023.
  32. Denoising diffusion implicit models. In International Conference on Learning Representations.
  33. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.  63–70. IEEE, 2024.
  34. Sparsedrive: End-to-end autonomous driving via sparse scene representation. arXiv preprint arXiv:2405.19620, 2024.
  35. Hpnet: Dynamic trajectory forecasting with historical prediction attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15261–15270, 2024.
  36. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8406–8415, 2023.
  37. Congested traffic states in empirical observations and microscopic simulations. Physical review E, 62(2):1805, 2000.
  38. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024a.
  39. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817, 2024b.
  40. 3d diffusion policy. arXiv preprint arXiv:2403.03954, 2024.
  41. Tnt: Target-driven trajectory prediction. In Conference on Robot Learning, pp.  895–904. PMLR, 2021.
  42. Occworld: Learning a 3d occupancy world model for autonomous driving. In European Conference on Computer Vision. Springer, 2024a.
  43. Genad: Generative end-to-end autonomous driving. arXiv preprint arXiv: 2402.11502, 2024b.
  44. Query-centric trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  17863–17873, 2023.

Summary

  • The paper introduces HE-Drive, a novel autonomous driving system that generates human-like trajectories with enhanced temporal consistency.
  • It employs sparse 3D perception and conditional denoising diffusion models to produce multi-modal motion plans for improved safety and comfort.
  • Vision-language models are used to score trajectories, achieving a 71% reduction in collision rates and ensuring a smooth driving experience.

The paper "HE-Drive: Human-Like End-to-End Driving with Vision LLMs" introduces an innovative approach to autonomous driving that prioritizes temporal consistency and passenger comfort. This method, called HE-Drive, leverages recent advancements in machine learning to address issues in trajectory prediction commonly faced by imitation learning-based planners.

Key Contributions:

  1. End-to-End Autonomous Driving: HE-Drive is designed as a human-like-centric system to improve the performance and quality of autonomous vehicle navigation by generating trajectories that feel more natural and comfortable to passengers.
  2. Sparse Perception and Conditional Inputs: The approach starts by extracting significant 3D spatial data through sparse perception methods. These data points are used as conditional inputs for the motion planning phase.
  3. Conditional Denoising Diffusion Probabilistic Models (DDPMs): The paper utilizes Conditional DDPMs to facilitate the generation of temporally consistent and multi-modal trajectories. This technique ensures that the trajectories align well over time, addressing the problem of temporal inconsistency.
  4. Vision-LLMs (VLMs) for Scoring: A key innovation is the use of Vision-LLMs to guide trajectory scoring. These models assess and select the most comfortable trajectory from numerous candidates, ensuring a higher degree of comfort in vehicle control.
  5. Performance Metrics: Experimental results demonstrate HE-Drive’s superior performance. It reduces the average collision rate by 71% compared to the Visual Autonomous Driving (VAD) system and operates 1.9 times faster than SparseDrive. The solution also excels in providing a comfortable driving experience based on real-world assessments.

Application and Impact:

HE-Drive signifies a significant step towards creating more sophisticated and human-like driving experiences in autonomous vehicles. By focusing on comfort and efficiency, the system could contribute substantially to the adoption of autonomous technologies in consumer markets.

Datasets:

The authors tested HE-Drive on challenging datasets such as nuScenes and OpenScene, underlining its robustness and applicability in diverse scenarios.

Overall, HE-Drive represents a compelling synthesis of cutting-edge technologies in machine learning, positioning itself as a potential leader in end-to-end autonomous driving systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.