Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-task Learning for Real-time Autonomous Driving Leveraging Task-adaptive Attention Generator (2403.03468v1)

Published 6 Mar 2024 in cs.CV

Abstract: Real-time processing is crucial in autonomous driving systems due to the imperative of instantaneous decision-making and rapid response. In real-world scenarios, autonomous vehicles are continuously tasked with interpreting their surroundings, analyzing intricate sensor data, and making decisions within split seconds to ensure safety through numerous computer vision tasks. In this paper, we present a new real-time multi-task network adept at three vital autonomous driving tasks: monocular 3D object detection, semantic segmentation, and dense depth estimation. To counter the challenge of negative transfer, which is the prevalent issue in multi-task learning, we introduce a task-adaptive attention generator. This generator is designed to automatically discern interrelations across the three tasks and arrange the task-sharing pattern, all while leveraging the efficiency of the hard-parameter sharing approach. To the best of our knowledge, the proposed model is pioneering in its capability to concurrently handle multiple tasks, notably 3D object detection, while maintaining real-time processing speeds. Our rigorously optimized network, when tested on the Cityscapes-3D datasets, consistently outperforms various baseline models. Moreover, an in-depth ablation study substantiates the efficacy of the methodologies integrated into our framework.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint arXiv:1706.05098, 2017.
  2. Z. Kang, K. Grauman, and F. Sha, “Learning with whom to share in multi-task feature learning,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 521–528.
  3. T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese, “Which tasks should be learned together in multi-task learning?” in International Conference on Machine Learning.   PMLR, 2020, pp. 9120–9132.
  4. I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch networks for multi-task learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3994–4003.
  5. S. Ruder, J. Bingel, I. Augenstein, and A. Søgaard, “Sluice networks: Learning what to share between loosely related tasks,” arXiv preprint arXiv:1705.08142, vol. 2, 2017.
  6. Y. Gao, H. Bai, Z. Jie, J. Ma, K. Jia, and W. Liu, “Mtl-nas: Task-agnostic neural architecture search towards general-purpose multi-task learning,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2020, pp. 11 543–11 552.
  7. X. Sun, R. Panda, R. Feris, and K. Saenko, “Adashare: Learning what to share for efficient deep multi-task learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 8728–8740, 2020.
  8. K. Maziarz, E. Kokiopoulou, A. Gesmundo, L. Sbaiz, G. Bartok, and J. Berent, “Flexible multi-task networks by learning parameter allocation,” arXiv preprint arXiv:1910.04915, 2019.
  9. J. Ma, Z. Zhao, J. Chen, A. Li, L. Hong, and E. H. Chi, “Snr: Sub-network routing for flexible parameter sharing in multi-task learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 216–223.
  10. P. Guo, C.-Y. Lee, and D. Ulbricht, “Learning to branch for multi-task learning,” in International conference on machine learning.   PMLR, 2020, pp. 3854–3863.
  11. C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra, “Pathnet: Evolution channels gradient descent in super neural networks,” arXiv preprint arXiv:1701.08734, 2017.
  12. B. Jou and S.-F. Chang, “Deep cross residual learning for multitask visual recognition,” in Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 998–1007.
  13. J. Huang, R. S. Feris, Q. Chen, and S. Yan, “Cross-domain image retrieval with a dual attribute-aware ranking network,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1062–1070.
  14. H. Bilen and A. Vedaldi, “Integrated perception with recurrent multi-task neural networks,” Advances in neural information processing systems, vol. 29, 2016.
  15. C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 325–341.
  16. C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation,” International Journal of Computer Vision, vol. 129, pp. 3051–3068, 2021.
  17. Y. Hong, H. Pan, W. Sun, and Y. Jia, “Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes,” arXiv preprint arXiv:2101.06085, 2021.
  18. X. Liu, N. Xue, and T. Wu, “Learning auxiliary monocular contexts helps monocular 3d object detection,” in AAAI, vol. 36, no. 2, 2022, pp. 1810–1818.
  19. S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with attention,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1871–1880.
  20. G. Brazil and X. Liu, “M3d-rpn: Monocular 3d region proposal network for object detection,” in ICCV, 2019, pp. 9287–9296.
  21. A. Kumar, G. Brazil, E. Corona, A. Parchami, and X. Liu, “Deviant: Depth equivariant network for monocular 3d object detection,” in ECCV.   Springer, 2022, pp. 664–683.
  22. Y. Lu, X. Ma, L. Yang, T. Zhang, Y. Liu, Q. Chu, J. Yan, and W. Ouyang, “Geometry uncertainty projection network for monocular 3d object detection,” in ICCV, 2021, pp. 3111–3121.
  23. X. Ma, Y. Zhang, D. Xu, D. Zhou, S. Yi, H. Li, and W. Ouyang, “Delving into localization errors for monocular 3d object detection,” in CVPR, 2021, pp. 4721–4730.
  24. A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding box estimation using deep learning and geometry,” in CVPR, 2017, pp. 7074–7082.
  25. Z. Wu, Y. Wu, J. Pu, X. Li, and X. Wang, “Attention-based depth distillation with 3d-aware positional encoding for monocular 3d object detection,” arXiv preprint arXiv:2211.16779, 2022.
  26. Y. Zhang, J. Lu, and J. Zhou, “Objects are different: Flexible monocular 3d object detection,” in CVPR, 2021, pp. 3289–3298.
  27. X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
  28. Y.-N. Chen, H. Dai, and Y. Ding, “Pseudo-stereo for monocular 3d object detection in autonomous driving,” in CVPR, 2022, pp. 887–897.
  29. M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo, “Learning depth-guided convolutions for monocular 3d object detection,” in CVPRW, 2020, pp. 1000–1001.
  30. K.-C. Huang, T.-H. Wu, H.-T. Su, and W. H. Hsu, “Monodtr: Monocular 3d object detection with depth-aware transformer,” in CVPR, 2022, pp. 4012–4021.
  31. L. Peng, X. Wu, Z. Yang, H. Liu, and D. Cai, “Did-m3d: Decoupling instance depth for monocular 3d object detection,” in ECCV.   Springer, 2022, pp. 71–88.
  32. C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth distribution network for monocular 3d object detection,” in CVPR, 2021, pp. 8555–8564.
  33. Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in CVPR, 2019, pp. 8445–8453.
  34. F. Manhardt, W. Kehl, and A. Gaidon, “Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape,” in CVPR, 2019, pp. 2069–2078.
  35. Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Data-driven 3d voxel patterns for object category recognition,” in CVPR, 2015, pp. 1903–1911.
  36. R. Zhang, H. Qiu, T. Wang, X. Xu, Z. Guo, Y. Qiao, P. Gao, and H. Li, “Monodetr: Depth-aware transformer for monocular 3d object detection,” arXiv preprint arXiv:2203.13310, 2022.
  37. Y. Hong, H. Dai, and Y. Ding, “Cross-modality knowledge distillation network for monocular 3d object detection,” in ECCV.   Springer, 2022, pp. 87–104.
  38. D. S. Raychaudhuri, Y. Suh, S. Schulter, X. Yu, M. Faraki, A. K. Roy-Chowdhury, and M. Chandraker, “Controllable dynamic multi-task architectures,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 955–10 964.
  39. W. Choi and S. Im, “Dynamic neural network for multi-task learning searching across diverse network topologies,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3779–3788.
  40. H. Ye and D. Xu, “Taskprompter: Spatial-channel multi-task prompting for dense scene understanding,” in The Eleventh International Conference on Learning Representations, 2022.
  41. M. Orsic, I. Kreso, P. Bevandic, and S. Segvic, “In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 607–12 616.
  42. M. Gamal, M. Siam, and M. Abdel-Razek, “Shuffleseg: Real-time semantic segmentation network,” arXiv preprint arXiv:1803.03816, 2018.
  43. H. Li, P. Xiong, H. Fan, and J. Sun, “Dfanet: Deep feature aggregation for real-time semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9522–9531.
  44. J. Xu, Z. Xiong, and S. P. Bhattacharyya, “Pidnet: A real-time semantic segmentation network inspired by pid controllers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 529–19 539.
  45. B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 447–456.
  46. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  47. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
  48. F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2403–2412.
  49. T. Wang, X. Zhu, J. Pang, and D. Lin, “Fcos3d: Fully convolutional one-stage monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 913–922.
  50. N. Gählert, N. Jourdan, M. Cordts, U. Franke, and J. Denzler, “Cityscapes 3d: Dataset and benchmark for 9 dof vehicle detection,” arXiv preprint arXiv:2006.07864, 2020.
  51. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
  52. H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
  53. P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in CVPR, 2020, pp. 2446–2454.
  54. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
  55. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  56. H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 405–420.
  57. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, pp. 211–252, 2015.
  58. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.

Summary

We haven't generated a summary for this paper yet.