Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Many-to-many Image Generation with Auto-regressive Diffusion Models (2404.03109v1)

Published 3 Apr 2024 in cs.CV

Abstract: Recent advancements in image generation have made significant progress, yet existing models present limitations in perceiving and generating an arbitrary number of interrelated images within a broad context. This limitation becomes increasingly critical as the demand for multi-image scenarios, such as multi-view images and visual narratives, grows with the expansion of multimedia platforms. This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images, offering a scalable solution that obviates the need for task-specific solutions across different multi-image scenarios. To facilitate this, we present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images. Utilizing Stable Diffusion with varied latent noises, our method produces a set of interconnected images from a single caption. Leveraging MIS, we learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework. Throughout training on the synthetic MIS, the model excels in capturing style and content from preceding images - synthetic or real - and generates novel images following the captured patterns. Furthermore, through task-specific fine-tuning, our model demonstrates its adaptability to various multi-image generation tasks, including Novel View Synthesis and Visual Procedure Generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005–25017, 2022.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22563–22575, 2023.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18392–18402, 2023.
  4. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3558–3568, 2021.
  5. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13142–13153, 2023.
  6. Diffuseq: Sequence to sequence text generation with diffusion models. In The Eleventh International Conference on Learning Representations, 2022.
  7. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In International Conference on Machine Learning, pp. 11808–11826. PMLR, 2023.
  8. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7514–7528, 2021.
  9. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  10. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  11. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  12. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  13. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022b.
  14. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  13916–13932. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/huang23i.html.
  15. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  16. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6007–6017, 2023.
  17. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  18. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
  19. Storygan: A sequential conditional gan for story visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6329–6338, 2019.
  20. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
  21. Intelligent grimm–open-ended visual storytelling via latent diffusion models. arXiv preprint arXiv:2306.00973, 2023a.
  22. AudioLDM: Text-to-audio generation with latent diffusion models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  21450–21474. PMLR, 23–29 Jul 2023b. URL https://proceedings.mlr.press/v202/liu23f.html.
  23. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9298–9309, 2023c.
  24. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In European Conference on Computer Vision, pp.  70–87. Springer, 2022.
  25. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
  26. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pp. 16784–16804. PMLR, 2022.
  27. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  28. Synthesizing coherent story with auto-regressive latent diffusion models. arXiv preprint arXiv:2211.10950, 2022.
  29. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2022.
  30. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  31. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR, 2021.
  32. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  33. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  34. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  35. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022a.
  36. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2022b.
  37. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  38. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2022.
  39. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
  40. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  41. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  42. Visual goal-step inference using wikihow. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  2167–2179, 2021.
  43. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
  44. Planner: Generating diversified paragraph via latent language diffusion model. arXiv preprint arXiv:2306.02531, 2023.
Citations (2)

Summary

  • The paper introduces a many-to-many image generation framework with auto-regressive diffusion to create interconnected image series.
  • It employs a novel MIS dataset of 12 million synthetic multi-image samples and two architectural variants to enhance visual coherence.
  • Zero-shot generalization and fine-tuning experiments validate the model's robustness in tasks like novel view synthesis and visual procedure generation.

Many-to-many Image Generation with Auto-regressive Diffusion Models

Introduction

Recent progress in the domain of image generation has led to the development of sophisticated models capable of producing visually compelling single images. However, the ability to generate multiple interrelated images in a cohesive manner remains a relatively unexplored frontier. Addressing this gap, the paper introduces a domain-general framework for many-to-many image generation. This framework, underpinned by Auto-regressive Diffusion Models, is designed to generate a series of interconnected images from a given set of initial images. This is achieved without relying on task-specific solutions, offering a versatile approach to multi-image scenario generation.

Methodology

Multi-Image Dataset (MIS)

A pivotal contribution of this work is the introduction of the Multi-Image Set (MIS), a novel dataset comprising 12 million synthetic multi-image samples. Each sample consists of 25 images, interconnected through general semantic relationships, generated using Stable Diffusion models with varied latent noise. MIS serves not only as a training ground for the proposed model but also as a benchmark for evaluating many-to-many image generation tasks.

Many-to-Many Diffusion (M2M) Model

At the core of the proposed solution is the Many-to-Many Diffusion (M2M) model. It is an auto-regressive model that processes and generates images in a sequential manner based on their latent representations. Two novel architectural variants are explored; M2M with Self-encoder (M2M-Self) and M2M with DINO encoder (M2M-DINO). The former utilizes a U-Net-based denoising model for both preceding and noisy latent images, facilitating refined cross-attention across spatial dimensions. The latter incorporates external vision models, specifically a DINOv2 encoder, to enhance the encoding of preceding images with more discriminative visual features.

Evaluation and Results

Extensive experiments demonstrate the model's proficiency in capturing and reproducing the style and content across interconnected images. Notably, zero-shot generalization capabilities to real images were observed, evidencing the model's robustness and versatility. Task-specific fine-tuning further showcased the adaptability of the model to various generation tasks, such as Novel View Synthesis and Visual Procedure Generation. The paper employs metrics like Frechet Inception Distance (FID) and CLIP scores to evaluate image quality and contextual consistency, with M2M-DINO showing superior performance in maintaining coherence amongst generated image series.

Discussion

The research highlights the potential of auto-regressive diffusion models in the field of many-to-many image generation, offering a significant leap towards more flexible and context-aware image synthesis. The MIS dataset emerges as a valuable asset for further explorations in this field. However, challenges remain, particularly in generating human faces with higher fidelity and maintaining image quality over prolonged generative sequences. These areas indicate promising directions for future research.

Conclusion

This paper presents a groundbreaking approach to many-to-many image generation using auto-regressive diffusion models. Through the introduction of the MIS dataset and the development of the M2M model, the research opens new pathways for the generation of complex image sets. The demonstrated capacity to adapt to various multi-image generation tasks, combined with the model's robustness to zero-shot generalization, signifies a notable advancement in generative AI. Future work will undoubtedly explore refinements and applications of the M2M model, propelling the field toward ever-more sophisticated image generation capabilities.

Reddit Logo Streamline Icon: https://streamlinehq.com