Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net (2311.16488v1)
Abstract: Recently, diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop Partially Shared U-Net (PS-U-Net) architecture which is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details. Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned. Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models while having a comparable size, faster training, faster multimodal sampling, and more flexible generation.
- All are worth words: A vit backbone for diffusion models, 2023a.
- One transformer fits all distributions in multi-modal diffusion at scale, 2023b.
- Diffusion models beat gans on image synthesis, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Multisensory interplay reveals crossmodal influences on ’sensory-specific’ brain regions, neural responses, and judgments. Neuron, 57(1):11–23, 2008.
- Make-a-scene: Scene-based text-to-image generation with human priors, 2022.
- Is neocortex essentially multisensory? Trends in Cognitive Sciences, 10(6):278–285, 2006.
- Diffuseq: Sequence to sequence text generation with diffusion models, 2023.
- Denoising diffusion probabilistic models, 2020.
- Imagen video: High definition video generation with diffusion models, 2022.
- Argmax flows and multinomial diffusion: Learning categorical distributions, 2021.
- Jon H Kaas. Why does the brain have so many visual areas? Journal of Cognitive Neuroscience, 1(2):121–135, 1989.
- Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022a.
- Diffusion-lm improves controllable text generation, 2022b.
- Microsoft coco: Common objects in context, 2015.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
- Efficient estimation of word representations in vector space, 2013.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents, 2022.
- Mechanisms and streams for processing of ”what” and ”where” in auditory cortex. Proceedings of the National Academy of Sciences, 97(22):11800–11806, 2000.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022a.
- High-resolution image synthesis with latent diffusion models, 2022b.
- U-net: Convolutional networks for biomedical image segmentation, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
- Self-conditioned embedding diffusion for text generation, 2022.
- Any-to-any generation via composable diffusion, 2023.
- Attention is all you need, 2023.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks, 2017.
- Versatile diffusion: Text, images and variations all in one diffusion model, 2023.
- Scaling autoregressive models for content-rich text-to-image generation, 2022.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning, 2023.
- Lafite: Towards language-free training for text-to-image generation, 2022.
- Zizhao Hu (10 papers)
- Shaochong Jia (2 papers)
- Mohammad Rostami (64 papers)