Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMSFormer: Multimodal Transformer for Material and Semantic Segmentation (2309.04001v4)

Published 7 Sep 2023 in cs.CV and cs.LG

Abstract: Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different modality combinations. We also propose a new model named Multi-Modal Segmentation TransFormer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material and semantic segmentation tasks. MMSFormer outperforms current state-of-the-art models on three different datasets. As we begin with only one input modality, performance improves progressively as additional modalities are incorporated, showcasing the effectiveness of the fusion block in combining useful information from diverse input modalities. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials. The code and pretrained models will be made available at https://github.com/csiplab/MMSFormer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Hrfuser: A multi-resolution sensor fusion architecture for 2d object detection. In IEEE International Conference on Intelligent Transportation Systems (ITSC), 2023.
  2. Dhfnet: dual-decoding hierarchical fusion network for rgb-thermal semantic segmentation. The Visual Computer, pages 1–11, 2023.
  3. Dynamic region-aware convolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8064–8073, 2021.
  4. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  5. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. In European Conference on Computer Vision (ECCV), pages 561–577, 2020.
  6. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
  7. Color image segmentation: advances and prospects. Pattern recognition, 34(12):2259–2281, 2001.
  8. Feanet: Feature-enhanced attention network for rgb-thermal real-time semantic segmentation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 4467–4473. IEEE Press, 2021.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  10. Gebnet: Graph-enhancement branch network for rgb-t scene parsing. IEEE Signal Processing Letters, 29:2273–2277, 2022.
  11. Egfnet: Edge-aware guidance fusion network for rgb–thermal urban scene parsing. IEEE Transactions on Intelligent Transportation Systems, pages 1–13, 2023.
  12. Panoptic segmentation: A review. ArXiv, abs/2111.10250, 2021.
  13. Global contextually guided lightweight network for rgb-thermal urban scene understanding. Engineering Applications of Artificial Intelligence, 117:105510, 2023.
  14. A review on 2d instance segmentation based on deep neural networks. Image and Vision Computing, 120:104401, 2022.
  15. A review of semantic segmentation using deep neural networks. International journal of multimedia information retrieval, 7:87–93, 2018.
  16. Deep learning-based image segmentation on multimodal medical imaging. IEEE Transactions on Radiation and Plasma Medical Sciences, 3(2):162–169, 2019.
  17. A survey on instance segmentation: state of the art. International journal of multimedia information retrieval, 9(3):171–189, 2020.
  18. Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In Asian Conference on Computer Vision, November 2016.
  19. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  20. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In IEEE International Conference on Image Processing (ICIP), pages 1440–1444, 2019.
  21. Reconet: Recurrent correction network for fast and efficient multi-modality image fusion. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 539–555, Cham, 2022. Springer Nature Switzerland.
  22. Ccnet: Criss-cross attention for semantic segmentation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 603–612, 2019.
  23. Mmtm: Multimodal transfer module for cnn fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13289–13299, 2020.
  24. Panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9404–9413, 2019.
  25. Rgb-t semantic segmentation with location, activation, and sharpening. IEEE Transactions on Circuits and Systems for Video Technology, 33(3):1223–1235, 2023.
  26. Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21694–21704, Los Alamitos, CA, USA, jun 2023. IEEE Computer Society.
  27. Residual spatial fusion network for rgb-thermal semantic segmentation. arXiv:2306.10364, 2023.
  28. Explicit attention-enhanced fusion for rgb-thermal perception tasks. IEEE Robotics Autom. Lett., 8(7):4060–4067, 2023.
  29. Multimodal material segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19800–19808, June 2022.
  30. Efficientfcn: Holistically-guided decoding for semantic segmentation. In Computer Vision – ECCV 2020, pages 1–17. Springer International Publishing, 2020.
  31. Gcnet: Grid-like context-aware network for rgb-thermal semantic segmentation. Neurocomput., 506(C):60–67, sep 2022.
  32. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802–5811, 2022.
  33. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In International Conference on Computer Vision, 2023.
  34. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  35. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  36. Image segmentation using deep learning: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(7):3523–3542, 2021.
  37. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7077–7087, 2021.
  38. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  39. Pst900: Rgb-thermal calibration, dataset and segmentation network. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9441–9447, 2020.
  40. RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes. IEEE Robotics and Automation Letters, 4(3):2576–2583, July 2019.
  41. A dense material segmentation dataset for indoor and outdoor scene parsing. In European Conference on Computer Vision, pages 450–466. Springer, 2022.
  42. Cgfnet: Cross-guided fusion network for rgb-t salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(5):2949–2961, 2022.
  43. Understanding convolution for semantic segmentation. In 2018 IEEE winter conference on applications of computer vision (WACV), pages 1451–1460. Ieee, 2018.
  44. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In in IEEE/CVF international conference on computer vision, pages 568–578, 2021.
  45. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  46. Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv preprint arXiv:1903.11816, 2019.
  47. Segformer: Simple and efficient design for semantic segmentation with transformers. In Neural Information Processing Systems (NeurIPS), 2021.
  48. U2fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):502–518, 2022.
  49. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems, 2023.
  50. Delivering arbitrary-modal semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1136–1147, 2023.
  51. Deep multimodal fusion for semantic image segmentation: A survey. Image and Vision Computing, 105:104042, 2021.
  52. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  53. Lif-seg: Lidar and camera image fusion for 3d lidar semantic segmentation. IEEE Transactions on Multimedia, pages 1–11, 2023.
  54. Mitigating modality discrepancies for rgb-t semantic segmentation. IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2023.
  55. A feature divide-and-conquer network for rgb-t semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 33(6):2892–2905, 2023.
  56. Didfuse: Deep image decomposition for infrared and visible image fusion. In IJCAI, pages 970–976. ijcai.org, 2020.
  57. Decoupled dynamic filter networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6647–6656, 2021.
  58. Cacfnet: Cross-modal attention cascaded fusion network for rgb-t urban scene parsing. IEEE Transactions on Intelligent Vehicles, pages 1–10, 2023.
  59. Mtanet: Multitask-aware network with hierarchical multimodal fusion for rgb-t urban scene understanding. IEEE Transactions on Intelligent Vehicles, 8(1):48–58, 2023.
  60. Edge-aware guidance fusion network for rgb–thermal scene parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3571–3579, 2022.
  61. Mffenet: Multiscale feature fusion and enhancement network for rgb–thermal urban road scene parsing. IEEE Transactions on Multimedia, 24:2526–2538, 2022.
  62. Gmnet: Graded-feature multilabel-learning network for rgb-thermal urban scene semantic segmentation. IEEE Transactions on Image Processing, 30:7790–7802, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Md Kaykobad Reza (5 papers)
  2. Ashley Prater-Bennette (14 papers)
  3. M. Salman Asif (54 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com