S$^3$M-Net: Joint Learning of Semantic Segmentation and Stereo Matching for Autonomous Driving
Abstract: Semantic segmentation and stereo matching are two essential components of 3D environmental perception systems for autonomous driving. Nevertheless, conventional approaches often address these two problems independently, employing separate models for each task. This approach poses practical limitations in real-world scenarios, particularly when computational resources are scarce or real-time performance is imperative. Hence, in this article, we introduce S$3$M-Net, a novel joint learning framework developed to perform semantic segmentation and stereo matching simultaneously. Specifically, S$3$M-Net shares the features extracted from RGB images between both tasks, resulting in an improved overall scene understanding capability. This feature sharing process is realized using a feature fusion adaption (FFA) module, which effectively transforms the shared features into semantic space and subsequently fuses them with the encoded disparity features. The entire joint learning framework is trained by minimizing a novel semantic consistency-guided (SCG) loss, which places emphasis on the structural consistency in both tasks. Extensive experimental results conducted on the vKITTI2 and KITTI datasets demonstrate the effectiveness of our proposed joint learning framework and its superior performance compared to other state-of-the-art single-task networks. Our project webpage is accessible at mias.group/S3M-Net.
- B. Ranft and C. Stiller, “The role of machine vision for intelligent vehicles,” IEEE Transactions on Intelligent vehicles, vol. 1, no. 1, pp. 8–19, 2016.
- D. L. Fisher, M. Lohrenz, D. Moore, E. D. Nadler, and J. K. Pollard, “Humans and intelligent vehicles: The hope, the help, and the harm,” IEEE Transactions on Intelligent Vehicles, vol. 1, no. 1, pp. 56–67, 2016.
- R. Fan, H. Wang, P. Cai, and M. Liu, “SNE-RoadSeg: Incorporating surface normal information into semantic segmentation for accurate freespace detection,” in European Conference on Computer Vision (ECCV). Springer, 2020, pp. 340–356.
- G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 21 919–21 928.
- Z. Rao, B. Xiong, M. He, Y. Dai, R. He, Z. Shen, and X. Li, “Masked representation learning for domain generalized stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 5435–5444.
- G. Yang, H. Zhao, J. Shi, Z. Deng, and J. Jia, “SegStereo: Exploiting semantic information for disparity estimation,” in European Conference on Computer Vision (ECCV), 2018, pp. 636–651.
- J. Zhang, K. A. Skinner, R. Vasudevan, and M. Johnson-Roberson, “DispSegNet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1162–1169, 2019.
- Z. Wu, X. Wu, X. Zhang, S. Wang, and L. Ju, “Semantic stereo matching with pyramid cost volumes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7484–7493.
- W. Zhan, X. Ou, Y. Yang, and L. Chen, “DSNet: Joint learning for scene segmentation and disparity estimation,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 2946–2952.
- P. L. Dovesi, M. Poggi, L. Andraghetti, M. Martí, H. Kjellström, A. Pieropan, and S. Mattoccia, “Real-time semantic stereo matching,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 10 780–10 787.
- S. Chen, Z. Xiang, C. Qiao, Y. Chen, and T. Bai, “SGNet: Semantics guided deep stereo matching,” in Proceedings of the Asian Conference on Computer Vision (ACCV), 2020, pp. 106–122.
- V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
- O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–241.
- H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2881–2890.
- L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818.
- Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “CCNet: Criss-cross attention for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 603–612.
- C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “FuseNet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,” in Proceedings of the Asian Conference on Computer Vision (ACCV). Springer, 2017, pp. 213–228.
- Y. Sun, W. Zuo, and M. Liu, “RTFNet: RGB-Thermal fusion network for semantic segmentation of urban scenes,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2576–2583, 2019.
- H. Wang, R. Fan, P. Cai, and M. Liu, “SNE-RoadSeg+: Rethinking depth-normal translation and deep supervision for freespace detection,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 1140–1145.
- C. Min, W. Jiang, D. Zhao, J. Xu, L. Xiao, Y. Nie, and B. Dai, “ORFD: A dataset and benchmark for off-road freespace detection,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 2532–2538.
- J. Yang, B. Xue, Y. Feng, D. Wang, R. Fan, and Q. Chen, “Three-filters-to-normal+: Revisiting discontinuity discrimination in depth-to-normal translation,” IEEE Transactions on Automation Science and Engineering, 2024, DOI: 10.1109/TASE.2024.3355941.
- J. Li, Y. Zhang, P. Yun, G. Zhou, Q. Chen, and R. Fan, “RoadFormer: Duplex Transformer for RGB-normal semantic road scene parsing,” arXiv preprint arXiv:2309.10356, 2023.
- R. Fan, C. Bowd, N. Brye, M. Christopher, R. N. Weinreb, D. J. Kriegman, and L. M. Zangwill, “One-vote veto: Semi-supervised learning for low-shot glaucoma diagnosis,” IEEE Transactions on Medical Imaging, 2023, DOI: 10.1109/TMI.2023.3307689.
- R. Fan, X. Ai, and N. Dahnoun, “Road surface 3D reconstruction based on dense subpixel disparity map estimation,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3025–3035, 2018.
- Y. Cabon, N. Murray, and M. Humenberger, “Virtual KITTI 2,” arXiv preprint arXiv:2001.10773, 2020.
- M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3061–3070.
- U. Michieli, M. Biasetton, G. Agresti, and P. Zanuttigh, “Adversarial learning and self-teaching techniques for domain adaptation in semantic segmentation,” IEEE Transactions on Intelligent Vehicles, vol. 5, no. 3, pp. 508–518, 2020.
- Y. Yang, C. Shan, F. Zhao, W. Liang, and J. Han, “On exploring shape and semantic enhancements for RGB-X semantic segmentation,” IEEE Transactions on Intelligent Vehicles, 2023.
- J. Fan, F. Wang, H. Chu, X. Hu, Y. Cheng, and B. Gao, “MLFNet: Multi-level fusion network for real-time semantic segmentation of autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 756–767, 2022.
- R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 7262–7272.
- E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 12 077–12 090, 2021.
- B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 1290–1299.
- Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5108–5115.
- J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5410–5418.
- X. Guo, K. Yang, W. Yang, X. Wang, and H. Li, “Group-wise correlation stereo network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3273–3282.
- H. Xu and J. Zhang, “AANet: Adaptive aggregation network for efficient stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1959–1968.
- X. Cheng, Y. Zhong, M. Harandi, Y. Dai, X. Chang, H. Li, T. Drummond, and Z. Ge, “Hierarchical neural architecture search for deep stereo matching,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 22 158–22 169, 2020.
- L. Lipson, Z. Teed, and J. Deng, “RAFT-Stereo: Multilevel recurrent field transforms for stereo matching,” in 2021 International Conference on 3D Vision (3DV). IEEE, 2021, pp. 218–227.
- J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 263–16 272.
- T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 1997–2017, 2019.
- Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” in European Conference on Computer Vision (ECCV). Springer, 2020, pp. 402–419.
- M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The CityScapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3213–3223.
- A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012, pp. 3354–3361.
- N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4040–4048.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- S. Okada, M. Ohzeki, and S. Taguchi, “Efficient partition of integer optimization problems with one-hot encoding,” Scientific Reports, vol. 9, no. 1, p. 13036, 2019.
- K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5693–5703.
- C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “BiSeNet V2: Bilateral network with guided aggregation for real-time semantic segmentation,” International Journal of Computer Vision, vol. 129, pp. 3051–3068, 2021.
- Y. Hong, H. Pan, W. Sun, and Y. Jia, “Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes,” arXiv preprint arXiv:2101.06085, 2021.
- G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation volume for accurate and efficient stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12 981–12 990.
- Z. Shen, Y. Dai, X. Song, Z. Rao, D. Zhou, and L. Zhang, “PCW-Net: Pyramid combination and warping cost volume for stereo matching,” in European Conference on Computer Vision (ECCV). Springer, 2022, pp. 280–297.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba, “Open vocabulary scene parsing,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2002–2010.
- J. Wei, S. Wang, and Q. Huang, “F33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTNet: fusion, feedback and focus for salient object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 12 321–12 328.
- Y. Pang, L. Zhang, X. Zhao, and H. Lu, “Hierarchical dynamic filtering network for RGB-D salient object detection,” in European Conference on Computer Vision (ECCV). Springer, 2020, pp. 235–252.
- X. Chen, K.-Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, and G. Zeng, “Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation,” in European Conference on Computer Vision (ECCV). Springer, 2020, pp. 561–577.
- Y. Wang, F. Sun, M. Lu, and A. Yao, “Learning deep multimodal feature representation with asymmetric multi-layer fusion,” in Proceedings of the 28th ACM International Conference on Multimedia (ACM MM), 2020, pp. 3902–3910.
- Y. Wang, W. Huang, F. Sun, T. Xu, Y. Rong, and J. Huang, “Deep multimodal fusion by channel exchanging,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 4835–4845, 2020.
- Y. Wang, X. Chen, L. Cao, W. Huang, F. Sun, and Y. Wang, “Multimodal token fusion for vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12 186–12 195.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.