DiffX: Guide Your Layout to Cross-Modal Generative Modeling (2407.15488v5)
Abstract: Diffusion models have made significant strides in language-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, such as chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal generation, called DiffX. Notably, our DiffX presents a compact and effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space. Moreover, we introduce the Joint-Modality Embedder (JME) to enhance the interaction between layout and text conditions by incorporating a gated attention mechanism. To facilitate the user-instructed training, we construct the cross-modal image datasets with detailed text captions by the Large-Multimodal Model (LMM) and our human-in-the-loop refinement. Through extensive experiments, our DiffX demonstrates robustness in cross-modal ''RGB+X'' image generation on FLIR, MFNet, and COME15K datasets, guided by various layout conditions. Meanwhile, it shows the strong potential for the adaptive generation of ``RGB+X+Y(+Z)'' images or more diverse modalities on FLIR, MFNet, COME15K, and MCXFace datasets. To our knowledge, DiffX is the first model for layout-guided cross-modal image generation. Our code and constructed cross-modal image datasets are available at https://github.com/zeyuwang-zju/DiffX.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- The Laplacian pyramid as a compact image code. In Readings in computer vision, 671–679. Elsevier.
- Diffusion models for multi-modal generative modeling.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12873–12883.
- F.A.Group. 2019. Flir thermal dataset for algorithm training. https://www.flir.co.uk/oem/adas/adas-dataset-form/.
- Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE international conference on computer vision, 4548–4557.
- BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network. In ECCV.
- Cross-modality fusion transformer for multispectral object detection. arXiv preprint arXiv:2111.00273.
- Generative Adversarial Networks. arXiv:1406.2661.
- MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 5108–5115. IEEE.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
- Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 694–711. Springer.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- ControlVAR: Exploring Controllable Visual Autoregressive Modeling. arXiv preprint arXiv:2406.09750.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22511–22521.
- High-resolution photorealistic image translation in real-time: A laplacian pyramid translation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9392–9400.
- Visual instruction tuning. Advances in neural information processing systems, 36.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11976–11986.
- Directed diffusion: Direct control of object placement through attention guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 4098–4106.
- 4m: Massively multimodal masked modeling. Advances in Neural Information Processing Systems, 36.
- Improved denoising diffusion probabilistic models. In International conference on machine learning, 8162–8171. PMLR.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2): 3.
- Zero-shot text-to-image generation. In International conference on machine learning, 8821–8831. Pmlr.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
- U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 234–241. Springer.
- Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10219–10228.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35: 36479–36494.
- ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognition, 145: 109913.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, 2256–2265. PMLR.
- Denoising Diffusion Implicit Models. In International Conference on Learning Representations.
- Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33: 7537–7547.
- Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208.
- Neural discrete representation learning. Advances in neural information processing systems, 30.
- Attention Is All You Need. arXiv:1706.03762.
- InstanceDiffusion: Instance-level Control for Image Generation.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4): 600–612.
- Hidanet: Rgb-d salient object detection via hierarchical depth awareness. IEEE Transactions on Image Processing, 32: 2160–2173.
- Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7452–7461.
- Law-diffusion: Complex scene generation by diffusion with layouts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 22669–22679.
- Long-clip: Unlocking the long-text capability of clip. arXiv preprint arXiv:2403.15378.
- Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In 2020 IEEE International Conference on Image Processing (ICIP), 276–280. IEEE.
- RGB-D Saliency Detection via Cascaded Mutual Information Minimization. In International Conference on Computer Vision (ICCV).
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 586–595.
- Edge-aware guidance fusion network for rgb–thermal scene parsing. In Proceedings of the AAAI conference on artificial intelligence, volume 36, 3571–3579.