From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion (2401.00421v1)
Abstract: With the rapid progression of deep learning technologies, multi-modality image fusion has become increasingly prevalent in object detection tasks. Despite its popularity, the inherent disparities in how different sources depict scene content make fusion a challenging problem. Current fusion methodologies identify shared characteristics between the two modalities and integrate them within this shared domain using either iterative optimization or deep learning architectures, which often neglect the intricate semantic relationships between modalities, resulting in a superficial understanding of inter-modal connections and, consequently, suboptimal fusion outcomes. To address this, we introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. This method capitalizes on the complementary characteristics of diverse modalities, bolstering both the accuracy and robustness of object detection. The codebook is utilized to enhance a streamlined and concise depiction of the fused intra- and inter-domain dynamics, fine-tuned for optimal performance in detection tasks. We present a bilevel optimization strategy that establishes a nexus between the joint problem of fusion and detection, optimizing both processes concurrently. Furthermore, we introduce the first dataset of paired infrared and visible images accompanied by text prompts, paving the way for future research. Extensive experiments on several datasets demonstrate that our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
- Multi-sensor image fusion based on fourth order partial differential equations. In 2017 20th International conference on information fusion (Fusion), pages 1–9. IEEE, 2017.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Virtex: Learning visual representations from textual annotations. CoRR, abs/2006.06666, 2020.
- Multi-modality image fusion for medical assistive technology management based on hybrid domain filtering. Expert Systems with Applications, 209:118283, 2022.
- Learning visual features from large weakly supervised data. CoRR, abs/1511.02251, 2015.
- Adaptive fusion method of visible light and infrared images based on non-subsampled shearlet transform and fast non-negative matrix factorization. Infrared Physics & Technology, 67:161–172, 2014.
- Learning visual n-grams from web data. CoRR, abs/1612.09161, 2016.
- Densefuse: A fusion approach to infrared and visible images. IEEE Transactions on Image Processing, 28(5):2614–2623, 2019.
- Infrared and visible image fusion using a deep learning framework. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 2705–2710, 2018.
- Performance comparison of different multi-resolution transforms for image fusion. Information Fusion, 12(2):74–84, 2011.
- Group-sparse representation with dictionary learning for medical image denoising and fusion. IEEE Transactions on Biomedical Engineering, 59(12):3450 – 3459, 2012. Cited by: 315.
- Fusion from decomposition: A self-supervised decomposition approach for image fusion. In European Conference on Computer Vision, 2022.
- Smoa: Searching a modality-oriented architecture for infrared and visible image fusion. IEEE Signal Processing Letters, 28:1818–1822, 2021a.
- Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection, 2022a.
- Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):105–119, 2022b.
- Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):105–119, 2022c.
- Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation, 2023.
- Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. CoRR, abs/2101.11517, 2021b.
- Task-oriented convex bilevel optimization with latent feasibility. IEEE Transactions on Image Processing, 31:1190–1203, 2022d.
- Multi-focus image fusion with a deep convolutional neural network. Information Fusion, 36:191–207, 2017.
- Infrared and visible image fusion methods and applications: A survey. Information Fusion, 45:153–178, 2019a.
- Fusiongan: A generative adversarial network for infrared and visible image fusion. Information Fusion, 48:11–26, 2019b.
- Ddcgan: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing, 29:4980–4995, 2020.
- Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation, 2022.
- Joint learning of localized representations from medical images and reports. CoRR, abs/2112.02889, 2021.
- Bilevel optimization with nonsmooth lower level problems. In Scale Space and Variational Methods in Computer Vision, pages 654–665, Cham, 2015. Springer International Publishing.
- Gonzalo Pajares and Jesús Manuel de la Cruz. A wavelet-based image fusion tutorial. Pattern Recognition, 37(9):1855–1872, 2004.
- Styleclip: Text-driven manipulation of stylegan imagery. CoRR, abs/2103.17249, 2021.
- Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
- Learning visual representations with caption annotations. In Computer Vision – ECCV 2020, pages 153–170, Cham, 2020. Springer International Publishing.
- Multispectral object detection for autonomous vehicles. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, page 35–43, New York, NY, USA, 2017. Association for Computing Machinery.
- Neural discrete representation learning, 2018.
- Fusion method for infrared and visible images by using non-negative sparse representation. Infrared Physics & Technology, 67:477–489, 2014.
- Solid-state color night vision: Fusion of low-light visible and thermal infrared imagery. Lincoln Laboratory Journal, 11, 1999.
- Drf: Disentangled representation for visible and infrared image fusion. IEEE Transactions on Instrumentation and Measurement, 70:1–13, 2021.
- U2fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):502–518, 2022.
- Sparse representation based multi-sensor image fusion for multi-focus and multi-modality images: A review. Information Fusion, 40:57–75, 2018.
- A categorization of multiscale-decomposition-based image fusion schemes with a performance study for a digital camera application. Proceedings of the IEEE, 87(8):1315 – 1326, 1999. Cited by: 739.
- Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion, 2023a.
- Ddfm: Denoising diffusion model for multi-modality image fusion, 2023b.
- Xingyuan Li (11 papers)
- Yang Zou (43 papers)
- Jinyuan Liu (55 papers)
- Zhiying Jiang (27 papers)
- Long Ma (116 papers)
- Xin Fan (97 papers)
- Risheng Liu (95 papers)