Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net (2311.16488v1)

Published 28 Nov 2023 in cs.CV and cs.AI

Abstract: Recently, diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop Partially Shared U-Net (PS-U-Net) architecture which is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details. Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned. Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models while having a comparable size, faster training, faster multimodal sampling, and more flexible generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. All are worth words: A vit backbone for diffusion models, 2023a.
  2. One transformer fits all distributions in multi-modal diffusion at scale, 2023b.
  3. Diffusion models beat gans on image synthesis, 2021.
  4. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  5. Multisensory interplay reveals crossmodal influences on ’sensory-specific’ brain regions, neural responses, and judgments. Neuron, 57(1):11–23, 2008.
  6. Make-a-scene: Scene-based text-to-image generation with human priors, 2022.
  7. Is neocortex essentially multisensory? Trends in Cognitive Sciences, 10(6):278–285, 2006.
  8. Diffuseq: Sequence to sequence text generation with diffusion models, 2023.
  9. Denoising diffusion probabilistic models, 2020.
  10. Imagen video: High definition video generation with diffusion models, 2022.
  11. Argmax flows and multinomial diffusion: Learning categorical distributions, 2021.
  12. Jon H Kaas. Why does the brain have so many visual areas? Journal of Cognitive Neuroscience, 1(2):121–135, 1989.
  13. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022a.
  14. Diffusion-lm improves controllable text generation, 2022b.
  15. Microsoft coco: Common objects in context, 2015.
  16. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  17. Efficient estimation of word representations in vector space, 2013.
  18. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  19. Hierarchical text-conditional image generation with clip latents, 2022.
  20. Mechanisms and streams for processing of ”what” and ”where” in auditory cortex. Proceedings of the National Academy of Sciences, 97(22):11800–11806, 2000.
  21. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022a.
  22. High-resolution image synthesis with latent diffusion models, 2022b.
  23. U-net: Convolutional networks for biomedical image segmentation, 2015.
  24. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  25. Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
  26. Self-conditioned embedding diffusion for text generation, 2022.
  27. Any-to-any generation via composable diffusion, 2023.
  28. Attention is all you need, 2023.
  29. Attngan: Fine-grained text to image generation with attentional generative adversarial networks, 2017.
  30. Versatile diffusion: Text, images and variations all in one diffusion model, 2023.
  31. Scaling autoregressive models for content-rich text-to-image generation, 2022.
  32. Scaling autoregressive multi-modal models: Pretraining and instruction tuning, 2023.
  33. Lafite: Towards language-free training for text-to-image generation, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zizhao Hu (10 papers)
  2. Shaochong Jia (2 papers)
  3. Mohammad Rostami (64 papers)

Summary

We haven't generated a summary for this paper yet.