FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba (2404.09498v2)
Abstract: Multi-modal image fusion aims to combine information from different modes to create a single image with comprehensive information and detailed textures. However, fusion models based on convolutional neural networks encounter limitations in capturing global image features due to their focus on local convolution operations. Transformer-based models, while excelling in global feature modeling, confront computational challenges stemming from their quadratic complexity. Recently, the Selective Structured State Space Model has exhibited significant potential for long-range dependency modeling with linear complexity, offering a promising avenue to address the aforementioned dilemma. In this paper, we propose FusionMamba, a novel dynamic feature enhancement method for multimodal image fusion with Mamba. Specifically, we devise an improved efficient Mamba model for image fusion, integrating efficient visual state space model with dynamic convolution and channel attention. This refined model not only upholds the performance of Mamba and global modeling capability but also diminishes channel redundancy while enhancing local enhancement capability. Additionally, we devise a dynamic feature fusion module (DFFM) comprising two dynamic feature enhancement modules (DFEM) and a cross modality fusion mamba module (CMFM). The former serves for dynamic texture enhancement and dynamic difference perception, whereas the latter enhances correlation features between modes and suppresses redundant intermodal information. FusionMamba has yielded state-of-the-art (SOTA) performance across various multimodal medical image fusion tasks (CT-MRI, PET-MRI, SPECT-MRI), infrared and visible image fusion task (IR-VIS) and multimodal biomedical image fusion dataset (GFP-PC), which is proved that our model has generalization ability. The code for FusionMamba is available at https://github.com/millieXie/FusionMamba.
- Hao Chen and Pramod K Varshney. 2007. A human perception inspired quality metric for image fusion based on regional information. Information fusion 8, 2 (2007), 193–207.
- A similarity metric for assessment of image fusion algorithms. International journal of signal processing 2, 3 (2005), 178–182.
- Mutual-guided dynamic network for image fusion. In Proceedings of the 31st ACM International Conference on Multimedia. 1779–1788.
- MambaIR: A Simple Baseline for Image Restoration with State-Space Model. arXiv preprint arXiv:2402.15648 (2024).
- Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1037–1045.
- High-throughput protein localization in Arabidopsis using Agrobacterium-mediated transient expression of GFP-ORF fusions. The Plant Journal 41, 1 (2005), 162–174.
- Hui Li and Xiao-Jun Wu. 2018. DenseFuse: A fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28, 5 (2018), 2614–2623.
- NestFuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Transactions on Instrumentation and Measurement 69, 12 (2020), 9645–9656.
- RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Information Fusion 73 (2021), 72–86.
- Medical image fusion via convolutional sparsity based morphological component analysis. IEEE Signal Processing Letters 26, 3 (2019), 485–489.
- Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166 (2024).
- SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA Journal of Automatica Sinica 9, 7 (2022), 1200–1217.
- DDcGAN: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing 29 (2020), 4980–4995.
- FusionGAN: A generative adversarial network for infrared and visible image fusion. Information fusion 48 (2019), 11–26.
- EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba. arXiv preprint arXiv:2403.09977 (2024).
- Gemma Piella and Henk Heijmans. 2003. A new quality metric for image fusion. In Proceedings 2003 international conference on image processing (Cat. No. 03CH37429), Vol. 3. IEEE, III–173.
- Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE international conference on computer vision. 4714–4722.
- U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 234–241.
- Jiacheng Ruan and Suncheng Xiang. 2024. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491 (2024).
- DIVFusion: Darkness-free infrared and visible image fusion. Information Fusion 91 (2023), 477–493.
- MATR: Multimodal medical image fusion via multiscale adaptive transformer. IEEE Transactions on Image Processing 31 (2022), 5134–5149.
- DATFuse: Infrared and visible image fusion via dual attention transformer. IEEE Transactions on Circuits and Systems for Video Technology (2023).
- Image fusion transformer. In 2022 IEEE International conference on image processing (ICIP). IEEE, 3566–3570.
- MRSCFusion: Joint Residual Swin Transformer and Multiscale CNN for Unsupervised Multimodal Medical Image Fusion. IEEE Transactions on Instrumentation and Measurement (2023).
- U2Fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 1 (2020), 502–518.
- Fusiondn: A unified densely connected network for image fusion. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 12484–12491.
- CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios. arXiv preprint arXiv:2403.04640 (2024).
- Rethinking vision transformer and masked autoencoder in multimodal face anti-spoofing. arXiv preprint arXiv:2302.05744 (2023).
- Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition. IEEE Transactions on Image Processing 30 (2021), 5626–5640.
- Hao Zhang and Jiayi Ma. 2021. SDNet: A versatile squeeze-and-decomposition network for real-time image fusion. International Journal of Computer Vision 129, 10 (2021), 2761–2785.
- Xingchen Zhang and Yiannis Demiris. 2023. Visible and infrared image fusion using deep learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
- IFCNN: A general image fusion framework based on convolutional neural network. Information Fusion 54 (2020), 99–118.
- Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13955–13965.
- Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5906–5916.
- Deep learning methods for medical image fusion: A review. Computers in Biology and Medicine (2023), 106959.
- Xinyu Xie (9 papers)
- Yawen Cui (19 papers)
- Chio-In Ieong (3 papers)
- Tao Tan (54 papers)
- Xiaozhi Zhang (8 papers)
- Xubin Zheng (8 papers)
- Zitong Yu (119 papers)