Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fusion-Mamba for Cross-modality Object Detection (2404.09146v1)

Published 14 Apr 2024 in cs.CV and cs.AI

Abstract: Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Cascade R-CNN: delving into high quality object detection. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6154–6162. Computer Vision Foundation / IEEE Computer Society, 2018.
  2. Multimodal object detection by channel switching and spatial attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Workshops, Vancouver, BC, Canada, June 17-24, 2023, pages 403–411. IEEE, 2023.
  3. Video mamba suite: State space model as a versatile alternative for video understanding. CoRR, abs/2403.09626, 2024.
  4. Multimodal object detection via probabilistic ensembling. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX, pages 139–158. Springer, 2022.
  5. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  6. Cross-modality fusion transformer for multispectral object detection. CoRR, abs/2111.00273, 2021.
  7. TELEDYNE FLIR. Free teledyne flir thermal dataset for algorithm training. Online, 2024.
  8. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023.
  9. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fusion, 50:148–157, 2019.
  10. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society, 2016.
  11. LLVIP: A visible-infrared paired dataset for low-light vision. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, BC, Canada, October 11-17, 2021, pages 3489–3497. IEEE, 2021.
  12. ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations. Zenodo, 2022.
  13. Ultralytics YOLO, 2023.
  14. Crossformer: Cross-guided attention for multi-modal object detection. Pattern Recognition Letters, 2024.
  15. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process., 32:1745–1758, 2023a.
  16. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit., 85:161–171, 2019.
  17. Learning a graph neural network with cross modality interaction for image fusion. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pages 4471–4479. ACM, 2023b.
  18. Explicit attention-enhanced fusion for rgb-thermal perception tasks. IEEE Robotics Autom. Lett., 8(7):4060–4067, 2023.
  19. Multispectral deep neural networks for pedestrian detection. 2016.
  20. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5802–5811, 2022.
  21. Vmamba: Visual state space model. CoRR, abs/2401.10166, 2024.
  22. U-mamba: Enhancing long-range dependency for biomedical image segmentation. CoRR, abs/2401.04722, 2024.
  23. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 779–788. IEEE Computer Society, 2016.
  24. Faster R-CNN: towards real-time object detection with region proposal networks. pages 91–99, 2015.
  25. Vm-unet: Vision mamba unet for medical image segmentation. CoRR, abs/2402.02491, 2024.
  26. Icafusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit., 145:109913.
  27. Very deep convolutional networks for large-scale image recognition. 2015.
  28. Detfusion: A detection-driven infrared and visible image fusion network. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pages 4003–4011. ACM, 2022.
  29. Going deeper with convolutions. 2014.
  30. Superfusion: A versatile image registration and fusion network with semantic awareness. IEEE CAA J. Autom. Sinica, 9(12):2121–2137, 2022.
  31. Divfusion: Darkness-free infrared and visible image fusion. Inf. Fusion, 91:477–493, 2023.
  32. Attention is all you need. pages 5998–6008, 2017.
  33. Improving rgb-infrared object detection by reducing cross-modality redundancy. Remote. Sens., 14(9):2020, 2022.
  34. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. CoRR, abs/2402.05079, 2024.
  35. YOLO-MS: multispectral object detection via feature interaction and self-attention guided fusion. IEEE Trans. Cogn. Dev. Syst., 15(4):2132–2143, 2023.
  36. Rfnet: Unsupervised network for mutually reinforcing multi-modal image registration and fusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 19647–19656. IEEE, 2022.
  37. Sdnet: A versatile squeeze-and-decomposition network for real-time image fusion. Int. J. Comput. Vis., 129(10):2761–2785, 2021.
  38. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In IEEE International Conference on Image Processing, ICIP 2020, Abu Dhabi, United Arab Emirates, October 25-28, 2020, pages 276–280. IEEE, 2020.
  39. Guided attentive feature fusion for multispectral pedestrian detection. In IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021, pages 72–80. IEEE, 2021.
  40. Isnet: Shape matters for infrared small target detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 867–876. IEEE, 2022.
  41. Vm-unet-v2 rethinking vision mamba unet for medical image segmentation. arXiv preprint arXiv:2403.09157, 2024.
  42. Dense distinct query for end-to-end object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 7329–7338. IEEE, 2023.
  43. Removal and selection: Improving rgb-infrared object detection via coarse-to-fine fusion. CoRR, abs/2401.10731, 2024.
  44. Didfuse: Deep image decomposition for infrared and visible image fusion. pages 970–976, 2020.
  45. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 5906–5916. IEEE, 2023.
  46. GFD-SSD: gated fusion double SSD for multispectral pedestrian detection. CoRR, abs/1903.06999, 2019.
  47. Vision mamba: Efficient visual representation learning with bidirectional state space model. CoRR, abs/2401.09417, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Wenhao Dong (15 papers)
  2. Haodong Zhu (9 papers)
  3. Shaohui Lin (45 papers)
  4. Xiaoyan Luo (8 papers)
  5. Yunhang Shen (55 papers)
  6. Xuhui Liu (17 papers)
  7. Juan Zhang (94 papers)
  8. Guodong Guo (75 papers)
  9. Baochang Zhang (113 papers)
Citations (17)

Summary

We haven't generated a summary for this paper yet.