TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion (2312.14209v2)
Abstract: Advanced image fusion methods are devoted to generating the fusion results by aggregating the complementary information conveyed by the source images. However, the difference in the source-specific manifestation of the imaged scene content makes it difficult to design a robust and controllable fusion process. We argue that this issue can be alleviated with the help of higher-level semantics, conveyed by the text modality, which should enable us to generate fused images for different purposes, such as visualisation and downstream tasks, in a controllable way. This is achieved by exploiting a vision-and-LLM to build a coarse-to-fine association mechanism between the text and image signals. With the guidance of the association maps, an affine fusion unit is embedded in the transformer network to fuse the text and vision modalities at the feature level. As another ingredient of this work, we propose the use of textual attention to adapt image quality assessment to the fusion task. To facilitate the implementation of the proposed text-guided fusion paradigm, and its adoption by the wider research community, we release a text-annotated image fusion dataset IVT. Extensive experiments demonstrate that our approach (TextFusion) consistently outperforms traditional appearance-based fusion methods. Our code and dataset will be publicly available at https://github.com/AWCXV/TextFusion.
- Equivalent transformation and dual stream network construction for mobile image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14102–14111, 2023.
- Unifusion: A lightweight unified image fusion network. IEEE Transactions on Instrumentation and Measurement, 70:1–14, 2021.
- Mufusion: A general unsupervised image fusion network based on memory unit. Information Fusion, 92:80–92, 2023.
- Image fusion based on generative adversarial network consistent with perception. Information Fusion, 72:110–125, 2021.
- Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020.
- Degradation-resistant unfolding network for heterogeneous image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12611–12621, 2023.
- Multimodal medical image fusion review: Theoretical background and recent advances. Signal Processing, 183:108036, 2021.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020.
- Reconet: Recurrent correction network for fast and efficient multi-modality image fusion. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVIII, pages 539–555. Springer, 2022.
- Llvip: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3496–3504, 2021.
- Current advances and future perspectives of image fusion: A comprehensive review. Information Fusion, 90:185–217, 2023.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
- Manigan: Text-guided image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7880–7889, 2020.
- Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546, 2022.
- Densefuse: A fusion approach to infrared and visible images. IEEE Transactions on Image Processing, 28(5):2614–2623, 2018.
- Rfn-nest: An end-to-end residual fusion network for infrared and visible images. Information Fusion, 73:72–86, 2021.
- Lrrnet: A novel representation learning guided fusion network for infrared and visible images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023a.
- Ovtrack: Open-vocabulary multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5567–5577, 2023b.
- Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15305–15314, 2023.
- Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802–5811, 2022.
- Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8115–8124, 2023.
- A bilevel integrated model with data-driven layer ensemble for multi-modality image fusion. IEEE Transactions on Image Processing, 30:1261–1274, 2020.
- Multi-focus image fusion with a deep convolutional neural network. Information Fusion, 36:191–207, 2017.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021.
- Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.
- Fusiongan: A generative adversarial network for infrared and visible image fusion. Information Fusion, 48:11–26, 2019.
- Ddcgan: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing, 29:4980–4995, 2020.
- Tgfuse: An infrared and visible image fusion approach based on transformer and generative adversarial network. arXiv preprint arXiv:2201.10147, 2022.
- You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Vision transformers for single image dehazing. IEEE Transactions on Image Processing, 32:1927–1941, 2023.
- Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Information Fusion, 82:28–42, 2022.
- Divfusion: Darkness-free infrared and visible image fusion. Information Fusion, 91:477–493, 2023.
- Alexander Toet et al. Tno image fusion dataset. Figshare. data, 2014.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2831–2838, 2022.
- Multi-exposure image fusion techniques: A comprehensive review. Remote Sensing, 14(3):771, 2022.
- U2fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020a.
- Mef-gan: multi-exposure image fusion via generative adversarial networks. IEEE Transactions on Image Processing, 29:7203–7216, 2020b.
- Murf: Mutually reinforcing multi-modal image registration and fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Objective image fusion performance measure. Electronics letters, 36(4):308–309, 2000.
- Multifocus image fusion and restoration with sparse representation. IEEE Transactions on Instrumentation and Measurement, 59(4):884–892, 2009.
- Pixel-level image fusion with simultaneous orthogonal matching pursuit. Information fusion, 13(1):10–19, 2012.
- Sdnet: A versatile squeeze-and-decomposition network for real-time image fusion. International Journal of Computer Vision, pages 1–25, 2021.
- Rethinking the image fusion: A fast unified image fusion network based on proportional maintenance of gradient and intensity. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12797–12804, 2020a.
- Vinvl: Making visual representations matter in vision-language models. arXiv e-prints, pages arXiv–2101, 2021.
- Xingchen Zhang. Deep learning-based multi-focus image fusion: A survey and a comparative study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Visible and infrared image fusion using deep learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Ifcnn: A general image fusion framework based on convolutional neural network. Information Fusion, 54:99–118, 2020b.
- Learning specific and general realm feature representations for image fusion. IEEE Transactions on Multimedia, 23:2745–2756, 2020.
- Depth-distilled multi-focus image fusion. IEEE Transactions on Multimedia, pages 1–1, 2021.
- Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13955–13965, 2023a.
- Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5906–5916, 2023b.
- Ddfm: Denoising diffusion model for multi-modality image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8082–8093, 2023c.
- Unsupervised vision-and-language pre-training via retrieval-based multi-granular alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16485–16494, 2022.
- Chunyang Cheng (12 papers)
- Tianyang Xu (53 papers)
- Xiao-Jun Wu (114 papers)
- Hui Li (1005 papers)
- Xi Li (199 papers)
- Zhangyong Tang (9 papers)
- Josef Kittler (102 papers)