Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing (2309.10356v4)

Published 19 Sep 2023 in cs.CV and cs.RO

Abstract: The recent advancements in deep convolutional neural networks have shown significant promise in the domain of road scene parsing. Nevertheless, the existing works focus primarily on freespace detection, with little attention given to hazardous road defects that could compromise both driving safety and comfort. In this paper, we introduce RoadFormer, a novel Transformer-based data-fusion network developed for road scene parsing. RoadFormer utilizes a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are subsequently fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder then learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Additionally, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark. Our source code, created dataset, and demo video are publicly available at mias.group/RoadFormer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. A. Geiger et al., “Vision meets robotics: The KITTI dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
  2. J. Li et al., “Towards Broad Learning Networks on Unmanned Mobile Robot for Semantic Segmentation,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 9228–9234.
  3. Y. Du et al., “Velocity Control Strategies to Improve Automated Vehicle Driving Comfort,” IEEE Intelligent Transportation Systems Magazine, vol. 10, no. 1, pp. 8–18, 2018.
  4. B. Barabino et al., “Standing Passenger Comfort: A New Scale for Evaluating the Real-Time Driving Style of Bus Transit Services,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 12, pp. 4665–4678, 2019.
  5. A. Wedel et al., “B-Spline Modeling of Road Surfaces With an Application to Free-Space Estimation,” IEEE Transactions on Intelligent Transportation Systems, vol. 10, no. 4, pp. 572–583, 2009.
  6. R. Fan et al., “Pothole Detection Based on Disparity Transformation and Road Surface Modeling,” IEEE Transactions on Image Processing, vol. 29, pp. 897–908, 2019.
  7. C. Lu et al., “Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder–Decoder Networks,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 445–452, 2019.
  8. C. Hazirbas et al., “FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture,” in 13th Asian Conference on Computer Vision (ACCV).   Springer, 2017, pp. 213–228.
  9. H. Wang et al., “SNE-RoadSeg+: Rethinking Depth-Normal Translation and Deep Supervision for Freespace Detection,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 1140–1145.
  10. K. He et al., “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  11. J. Fritsch et al., “A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms,” in 16th International IEEE Conference on Intelligent Transportation Systems (ITSC) 2013.   IEEE, 2013, pp. 1693–1700.
  12. K. Han et al., “A Survey on Vision Transformer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 87–110, 2022.
  13. Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 012–10 022.
  14. E. Xie et al., “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 12 077–12 090, 2021.
  15. K. Li et al., “UniFormer: Unifying Convolution and Self-Attention for Visual Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  16. C. Min et al., “ORFD: A Dataset and Benchmark for Off-Road Freespace Detection,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 2532–2538.
  17. P. Sun et al., “Scalability in Perception for Autonomous Driving: Waymo Open Dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2446–2454.
  18. M. Cordts et al., “The Cityscapes Dataset for Semantic Urban Scene Understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3213–3223.
  19. J. Long et al., “Fully Convolutional Networks for Semantic Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.
  20. H. Ding et al., “Context Contrasted Feature and Gated Multi-Scale Aggregation for Scene Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2393–2402.
  21. J. Wang et al., “Deep High-Resolution Representation Learning for Visual Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3349–3364, 2020.
  22. R. Strudel et al., “Segmenter: Transformer for Semantic Segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 7262–7272.
  23. S. Zheng et al., “Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective With Transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6881–6890.
  24. A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” International Conference on Learning Representations (ICLR), 2020.
  25. B. Cheng et al., “Per-Pixel Classification is Not All You Need for Semantic Segmentation,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 17 864–17 875, 2021.
  26. Q. Ha et al., “MFNet: Towards Real-Time Semantic Segmentation for Autonomous Vehicles with Multi-Spectral Scenes,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 5108–5115.
  27. Y. Sun et al., “RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2576–2583, 2019.
  28. A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in International Conference on Machine Learning (ICML).   PMLR, 2021, pp. 8748–8763.
  29. T. Brown et al., “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020.
  30. P. Xu et al., “Multimodal Learning with Transformers: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023.
  31. J. Fu et al., “Dual Attention Network for Scene Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3146–3154.
  32. J. Hu et al., “Squeeze-and-Excitation Networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.
  33. S. Mai et al., “Analyzing Multimodal Sentiment Via Acoustic- and Visual-LSTM With Channel-Aware Temporal Convolution Network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1424–1437, 2021.
  34. F. Chollet, “Xception: Deep Learning With Depthwise Separable Convolutions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1251–1258.
  35. B. Cheng et al., “Masked-attention Mask Transformer for Universal Image Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 1290–1299.
  36. X. Zhu et al., “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in International Conference on Learning Representations (ICLR), 2020.
  37. J. Jain et al., “OneFormer: One Transformer to Rule Universal Image Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2989–2998.
  38. N. Carion et al., “End-to-End Object Detection with Transformers,” in European Conference on Computer Vision (ECCV).   Springer, 2020, pp. 213–229.
  39. A. Dosovitskiy et al., “CARLA: An Open Urban Driving Simulator,” in Conference on Robot Learning (CoRL).   PMLR, 2017, pp. 1–16.
  40. R. Fan et al., “Road Surface 3D Reconstruction Based on Dense Subpixel Disparity Map Estimation,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3025–3035, 2018.
  41. K. Perlin, “An Image Synthesizer,” ACM Siggraph Computer Graphics, vol. 19, no. 3, pp. 287–296, 1985.
  42. L. Lipson, Z. Teed, and J. Deng, “RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching,” in 2021 International Conference on 3D Vision (3DV).   IEEE, 2021, pp. 218–227.
  43. M. Menze and A. Geiger, “Object Scene Flow for Autonomous Vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3061–3070.
  44. Y. Feng, B. Xue, M. Liu, Q. Chen, and R. Fan, “D2nt: A high-performing depth-to-normal translator,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 12 360–12 366.
  45. I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in International Conference on Learning Representations (ICLR), 2018.
  46. S. Gu et al., “Histograms of the Normalized Inverse Depth and Line Scanning for Urban Road Detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 8, pp. 3070–3080, 2018.
  47. L. Sun et al., “Pseudo-LiDAR-Based Road Detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 8, pp. 5386–5398, 2022.
  48. A. A. Khan et al., “LRDNet: Lightweight LiDAR Aided Cascaded Feature Pools for Free Road Space Detection,” IEEE Transactions on Multimedia, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiahang Li (19 papers)
  2. Yikang Zhang (18 papers)
  3. Peng Yun (15 papers)
  4. Guangliang Zhou (3 papers)
  5. Qijun Chen (49 papers)
  6. Rui Fan (114 papers)
Citations (20)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com