Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Complementary Random Masking for RGB-Thermal Semantic Segmentation (2303.17386v2)

Published 30 Mar 2023 in cs.CV, cs.AI, and cs.RO

Abstract: RGB-thermal semantic segmentation is one potential solution to achieve reliable semantic scene understanding in adverse weather and lighting conditions. However, the previous studies mostly focus on designing a multi-modal fusion module without consideration of the nature of multi-modality inputs. Therefore, the networks easily become over-reliant on a single modality, making it difficult to learn complementary and meaningful representations for each modality. This paper proposes 1) a complementary random masking strategy of RGB-T images and 2) self-distillation loss between clean and masked input modalities. The proposed masking strategy prevents over-reliance on a single modality. It also improves the accuracy and robustness of the neural network by forcing the network to segment and classify objects even when one modality is partially available. Also, the proposed self-distillation loss encourages the network to extract complementary and meaningful representations from a single modality or complementary masked modalities. Based on the proposed method, we achieve state-of-the-art performance over three RGB-T semantic segmentation benchmarks. Our source code is available at https://github.com/UkcheolShin/CRM_RGBTSeg.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
  4. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 40(4):834–848, 2017.
  5. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
  6. Per-pixel classification is not all you need for semantic segmentation. 2021.
  7. The cityscapes dataset for semantic urban scene understanding. In Proc. of Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.
  8. Feanet: Feature-enhanced attention network for rgb-thermal real-time semantic segmentation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4467–4473. IEEE, 2021.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Cekd: Cross-modal edge-privileged knowledge distillation for semantic scene understanding using only thermal images. IEEE Robotics and Automation Letters, 2023.
  11. Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems (IROS), pages 5108–5115, 2017.
  12. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  13. Mask r-cnn. In Proc. of Int’l Conf. on Computer Vision (ICCV), pages 2961–2969, 2017.
  14. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proc. of Computer Vision and Pattern Recognition (CVPR), pages 1037–1045, 2015.
  15. Robust rgb-t tracking via graph attention-based bilinear pooling. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  16. Keyframe-based thermal–inertial odometry. Journal of Field Robotics, 37(4):552–579, 2020.
  17. Ms-uda: Multi-spectral unsupervised domain adaptation for thermal image semantic segmentation. IEEE Robotics and Automation Letters, 6(4):6497–6504, 2021.
  18. Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels. arXiv preprint arXiv:2301.12689, 2023.
  19. Sequential thermal image-based adult and baby detection robust to thermal residual heat marks. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13120–13127. IEEE, 2022.
  20. Rgb-t object tracking: Benchmark and baseline. Pattern Recognition, 96:106977, 2019.
  21. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. arXiv preprint arXiv:2203.04838, 2022.
  22. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016.
  23. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  24. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  25. Decoupled weight decay regularization. In International Conference on Learning Representations.
  26. Superthermal: Matching thermal as visible through thermal feature exploration. IEEE Robotics and Automation Letters, 6(2):2690–2697, 2021.
  27. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
  28. Masked autoencoders for point cloud self-supervised learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II, pages 604–621. Springer, 2022.
  29. Pytorch: An imperative style, high-performance deep learning library. In Proc. of Advances in Neural Information Processing Systems (NeurIPS), pages 8026–8037, 2019.
  30. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  31. Maximizing self-supervision from thermal image for effective self-supervised learning of depth and ego-motion. IEEE Robotics and Automation Letters, 7(3):7771–7778, 2022.
  32. Self-supervised depth and ego-motion estimation for monocular thermal video using multi-spectral consistency loss. IEEE Robotics and Automation Letters, 7(2):1103–1110, 2021.
  33. Self-supervised monocular depth estimation from thermal images via adversarial multi-spectral adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5798–5807, 2023.
  34. Sparse depth enhanced direct thermal-infrared slam beyond the visible spectrum. IEEE Robotics and Automation Letters, 4(3):2918–2925, 2019.
  35. Pst900: Rgb-thermal calibration, dataset and segmentation network. In IEEE Int’l Conf. on Robotics and Automation (ICRA), pages 9441–9447, 2020.
  36. Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes. IEEE Robotics and Automation Letters (RAL), 4(3):2576–2583, 2019.
  37. Fuseseg: Semantic segmentation of urban scenes based on rgb and thermal data fusion. IEEE Trans. on Automation Science and Engineering (TASE), 2020.
  38. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602, 2022.
  39. Heatnet: Bridging the day-night domain gap in semantic segmentation with thermal images. In IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems (IROS), 2020.
  40. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 2020.
  41. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
  42. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  43. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
  44. Attention fusion network for multi-spectral semantic segmentation. Pattern Recognition Letters, 146:179–184, 2021.
  45. Baanet: Learning bi-directional adaptive attention gates for multispectral pedestrian detection. In 2022 International Conference on Robotics and Automation (ICRA), pages 2920–2926. IEEE, 2022.
  46. Abmdrnet: Adaptive-weighted bi-directional modality difference reduction network for rgb-t semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2633–2642, 2021.
  47. Mitigating modality discrepancies for rgb-t semantic segmentation. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  48. Gmnet: graded-feature multilabel-learning network for rgb-thermal urban scene semantic segmentation. IEEE Transactions on Image Processing, 30:7790–7802, 2021.
  49. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ukcheol Shin (16 papers)
  2. Kyunghyun Lee (8 papers)
  3. In So Kweon (156 papers)
  4. Jean Oh (77 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.