Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey of Vision Transformers in Autonomous Driving: Current Trends and Future Directions (2403.07542v1)

Published 12 Mar 2024 in cs.CV and cs.LG

Abstract: This survey explores the adaptation of visual transformer models in Autonomous Driving, a transition inspired by their success in Natural Language Processing. Surpassing traditional Recurrent Neural Networks in tasks like sequential image processing and outperforming Convolutional Neural Networks in global context capture, as evidenced in complex scene recognition, Transformers are gaining traction in computer vision. These capabilities are crucial in Autonomous Driving for real-time, dynamic visual scene processing. Our survey provides a comprehensive overview of Vision Transformer applications in Autonomous Driving, focusing on foundational concepts such as self-attention, multi-head attention, and encoder-decoder architecture. We cover applications in object detection, segmentation, pedestrian detection, lane detection, and more, comparing their architectural merits and limitations. The survey concludes with future research directions, highlighting the growing role of Vision Transformers in Autonomous Driving.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  2. S. Alaparthi and M. Mishra, “Bidirectional encoder representations from transformers (bert): A sentiment analysis odyssey,” arXiv preprint arXiv:2007.01127, 2020.
  3. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” OpenAI, 2018.
  4. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  5. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  6. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision, pp. 213–229, Springer, 2020.
  7. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
  8. T. Kim, L. F. Vecchietti, K. Choi, S. Lee, and D. Har, “Machine learning for advanced wireless sensor networks: A review,” IEEE Sensors Journal, vol. 21, no. 11, pp. 12379–12397, 2020.
  9. J. Mao, S. Shi, X. Wang, and H. Li, “3d object detection for autonomous driving: A comprehensive survey,” International Journal of Computer Vision, pp. 1–55, 2023.
  10. J. Han, X. Deng, X. Cai, Z. Yang, H. Xu, C. Xu, and X. Liang, “Laneformer: Object-aware row-column transformers for lane detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 799–807, 2022.
  11. A. Ando, S. Gidaris, A. Bursuc, G. Puy, A. Boulch, and R. Marlet, “Rangevit: Towards vision transformers for 3d semantic segmentation in autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5240–5250, 2023.
  12. S. Cakir, M. Gauss, K. Happeler, Y. Ounajjar, F. Heinle, and R. Marchthaler, “Semantic segmentation for autonomous driving: Model evaluation, dataset generation, perspective comparison, and real-time capability,” arXiv preprint arXiv:2207.12939, 2022.
  13. M. Seo, L. F. Vecchietti, S. Lee, and D. Har, “Rewards prediction-based credit assignment for reinforcement learning with sparse binary rewards,” IEEE Access, vol. 7, pp. 118776–118791, 2019.
  14. L. F. Vecchietti, M. Seo, and D. Har, “Sampling rate decay in hindsight experience replay for robot control,” IEEE Transactions on Cybernetics, vol. 52, no. 3, pp. 1515–1526, 2020.
  15. H. Liu, Z. Huang, X. Mo, and C. Lv, “Augmenting reinforcement learning with transformer-based scene representation learning for decision-making of autonomous driving,” arXiv preprint arXiv:2208.12263, 2022.
  16. Y. Zhang, S. Zhang, D. Xin, D. Chen, et al., “A small target pedestrian detection model based on autonomous driving,” Journal of Advanced Transportation, vol. 2023, 2023.
  17. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  18. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  19. R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer normalization in the transformer architecture,” in International Conference on Machine Learning, pp. 10524–10533, PMLR, 2020.
  20. S. Takase, S. Kiyono, S. Kobayashi, and J. Suzuki, “On layer normalizations and residual connections in transformers,” arXiv preprint arXiv:2206.00330, 2022.
  21. S. Shleifer, J. Weston, and M. Ott, “Normformer: Improved transformer pretraining with extra normalization,” arXiv preprint arXiv:2110.09456, 2021.
  22. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
  23. T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, and Z. Tang, “Bevfusion: A simple and robust lidar-camera fusion framework,” Advances in Neural Information Processing Systems, vol. 35, pp. 10421–10434, 2022.
  24. Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han, “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,” in 2023 IEEE international conference on robotics and automation (ICRA), pp. 2774–2781, IEEE, 2023.
  25. Y. Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving,” arXiv preprint arXiv:2205.09743, 2022.
  26. Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” in Conference on Robot Learning, pp. 180–191, PMLR, 2022.
  27. D. Gong, J. Lee, M. Kim, S. J. Ha, and M. Cho, “Future transformer for long-term action anticipation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3052–3061, 2022.
  28. X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “Futr3d: A unified sensor fusion framework for 3d detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 172–181, 2023.
  29. Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” in European Conference on Computer Vision, pp. 531–548, Springer, 2022.
  30. Y. Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang, “Petrv2: A unified framework for 3d perception from multi-camera images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3262–3272, 2023.
  31. C.-Y. Tseng, Y.-R. Chen, H.-Y. Lee, T.-H. Wu, W.-C. Chen, and W. H. Hsu, “Crossdtr: Cross-view and depth-guided transformers for 3d object detection,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 4850–4857, IEEE, 2023.
  32. Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in European conference on computer vision, pp. 1–18, Springer, 2022.
  33. C. Yang, Y. Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y. Qiao, L. Lu, et al., “Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17830–17839, 2023.
  34. Y. Li, Y. Chen, X. Qi, Z. Li, J. Sun, and J. Jia, “Unifying voxel-based representation with transformer for 3d object detection,” Advances in Neural Information Processing Systems, vol. 35, pp. 18442–18455, 2022.
  35. Y. Huang, W. Zheng, Y. Zhang, J. Zhou, and J. Lu, “Tri-perspective view for vision-based 3d semantic occupancy prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9223–9232, 2023.
  36. Y. Li, Z. Yu, C. Choy, C. Xiao, J. M. Alvarez, S. Fidler, C. Feng, and A. Anandkumar, “Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9087–9098, 2023.
  37. Y. Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, and J. Lu, “Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21729–21740, 2023.
  38. F. Zeng, B. Dong, Y. Zhang, T. Wang, X. Zhang, and Y. Wei, “Motr: End-to-end multiple-object tracking with transformer,” in European Conference on Computer Vision, pp. 659–675, Springer, 2022.
  39. T. Zhang, X. Chen, Y. Wang, Y. Wang, and H. Zhao, “Mutr3d: A multi-camera tracking framework via 3d-to-2d queries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4537–4546, 2022.
  40. L. Peng, Z. Chen, Z. Fu, P. Liang, and E. Cheng, “Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5935–5943, 2023.
  41. L. Chen, C. Sima, Y. Li, Z. Zheng, J. Xu, X. Geng, H. Li, C. He, J. Shi, Y. Qiao, et al., “Persformer: 3d lane detection via perspective transformer and the openlane benchmark,” in European Conference on Computer Vision, pp. 550–567, Springer, 2022.
  42. R. Liu, Z. Yuan, T. Liu, and Z. Xiong, “End-to-end lane shape prediction with transformers,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 3694–3702, 2021.
  43. Y. Bai, Z. Chen, Z. Fu, L. Peng, P. Liang, and E. Cheng, “Curveformer: 3d lane detection by curve propagation with curve queries and attention,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 7062–7068, IEEE, 2023.
  44. A. Saha, O. Mendez, C. Russell, and R. Bowden, “Translating images into maps,” in 2022 International conference on robotics and automation (ICRA), pp. 9200–9206, IEEE, 2022.
  45. Z. Li, W. Wang, E. Xie, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo, and T. Lu, “Panoptic segformer: Delving deeper into panoptic segmentation with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1280–1289, 2022.
  46. Y. B. Can, A. Liniger, D. P. Paudel, and L. Van Gool, “Structured bird’s-eye-view traffic scene understanding from onboard images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15661–15670, 2021.
  47. Y. Liu, T. Yuan, Y. Wang, Y. Wang, and H. Zhao, “Vectormapnet: End-to-end vectorized hd map learning,” in International Conference on Machine Learning, pp. 22352–22369, PMLR, 2023.
  48. B. Liao, S. Chen, X. Wang, T. Cheng, Q. Zhang, W. Liu, and C. Huang, “Maptr: Structured modeling and learning for online vectorized hd map construction,” arXiv preprint arXiv:2208.14437, 2022.
  49. J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid, “Vectornet: Encoding hd maps and agent dynamics from vectorized representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11525–11533, 2020.
  50. H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y. Shen, Y. Shen, Y. Chai, C. Schmid, et al., “Tnt: Target-driven trajectory prediction,” in Conference on Robot Learning, pp. 895–904, PMLR, 2021.
  51. J. Gu, C. Sun, and H. Zhao, “Densetnt: End-to-end trajectory prediction from dense goal sets,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15303–15312, 2021.
  52. Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou, “Multimodal motion prediction with stacked transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7577–7586, 2021.
  53. Y. Yuan, X. Weng, Y. Ou, and K. M. Kitani, “Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9813–9823, 2021.
  54. N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion forecasting via simple & efficient attention networks,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2980–2987, IEEE, 2023.
  55. K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger, “Transfuser: Imitation with transformer-based sensor fusion for autonomous driving,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  56. Q.-V. Lai-Dang, J. Lee, B. Park, and D. Har, “Sensor fusion by spatial encoding for autonomous driving,” arXiv preprint arXiv:2308.10707, 2023.
  57. K. Chitta, A. Prakash, and A. Geiger, “Neat: Neural attention fields for end-to-end autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15793–15803, 2021.
  58. H. Shao, L. Wang, R. Chen, H. Li, and Y. Liu, “Safety-enhanced autonomous driving using interpretable sensor fusion transformer,” in Conference on Robot Learning, pp. 726–737, PMLR, 2023.
  59. Q. Zhang, M. Tang, R. Geng, F. Chen, R. Xin, and L. Wang, “Mmfn: Multi-modal-fusion-net for end-to-end driving,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8638–8643, IEEE, 2022.
  60. S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,” in European Conference on Computer Vision, pp. 533–549, Springer, 2022.
  61. Y. Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al., “Planning-oriented autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17853–17862, 2023.
  62. I. Park, D. Kim, and D. Har, “Mac achieving low latency and energy efficiency in hierarchical m2m networks with clustered nodes,” IEEE Sensors Journal, vol. 15, no. 3, pp. 1657–1661, 2015.
  63. J. Park, E. Hong, and D. Har, “Low complexity data decoding for slm-based ofdm systems without side information,” IEEE Communications Letters, vol. 15, no. 6, pp. 611–613, 2011.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Quoc-Vinh Lai-Dang (4 papers)
Citations (2)