Many-to-many Image Generation with Auto-regressive Diffusion Models

Published 3 Apr 2024 in cs.CV | (2404.03109v1)

Abstract: Recent advancements in image generation have made significant progress, yet existing models present limitations in perceiving and generating an arbitrary number of interrelated images within a broad context. This limitation becomes increasingly critical as the demand for multi-image scenarios, such as multi-view images and visual narratives, grows with the expansion of multimedia platforms. This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images, offering a scalable solution that obviates the need for task-specific solutions across different multi-image scenarios. To facilitate this, we present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images. Utilizing Stable Diffusion with varied latent noises, our method produces a set of interconnected images from a single caption. Leveraging MIS, we learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework. Throughout training on the synthetic MIS, the model excels in capturing style and content from preceding images - synthetic or real - and generates novel images following the captured patterns. Furthermore, through task-specific fine-tuning, our model demonstrates its adaptability to various multi-image generation tasks, including Novel View Synthesis and Visual Procedure Generation.

Abstract PDF HTML Upgrade to Chat

Authors (6)

References (44)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a many-to-many image generation framework with auto-regressive diffusion to create interconnected image series.
It employs a novel MIS dataset of 12 million synthetic multi-image samples and two architectural variants to enhance visual coherence.
Zero-shot generalization and fine-tuning experiments validate the model's robustness in tasks like novel view synthesis and visual procedure generation.

Many-to-many Image Generation with Auto-regressive Diffusion Models

Introduction

Recent progress in the domain of image generation has led to the development of sophisticated models capable of producing visually compelling single images. However, the ability to generate multiple interrelated images in a cohesive manner remains a relatively unexplored frontier. Addressing this gap, the paper introduces a domain-general framework for many-to-many image generation. This framework, underpinned by Auto-regressive Diffusion Models, is designed to generate a series of interconnected images from a given set of initial images. This is achieved without relying on task-specific solutions, offering a versatile approach to multi-image scenario generation.

Methodology

Multi-Image Dataset (MIS)

A pivotal contribution of this work is the introduction of the Multi-Image Set (MIS), a novel dataset comprising 12 million synthetic multi-image samples. Each sample consists of 25 images, interconnected through general semantic relationships, generated using Stable Diffusion models with varied latent noise. MIS serves not only as a training ground for the proposed model but also as a benchmark for evaluating many-to-many image generation tasks.

Many-to-Many Diffusion (M2M) Model

At the core of the proposed solution is the Many-to-Many Diffusion (M2M) model. It is an auto-regressive model that processes and generates images in a sequential manner based on their latent representations. Two novel architectural variants are explored; M2M with Self-encoder (M2M-Self) and M2M with DINO encoder (M2M-DINO). The former utilizes a U-Net-based denoising model for both preceding and noisy latent images, facilitating refined cross-attention across spatial dimensions. The latter incorporates external vision models, specifically a DINOv2 encoder, to enhance the encoding of preceding images with more discriminative visual features.

Evaluation and Results

Extensive experiments demonstrate the model's proficiency in capturing and reproducing the style and content across interconnected images. Notably, zero-shot generalization capabilities to real images were observed, evidencing the model's robustness and versatility. Task-specific fine-tuning further showcased the adaptability of the model to various generation tasks, such as Novel View Synthesis and Visual Procedure Generation. The study employs metrics like Frechet Inception Distance (FID) and CLIP scores to evaluate image quality and contextual consistency, with M2M-DINO showing superior performance in maintaining coherence amongst generated image series.

Discussion

The research highlights the potential of auto-regressive diffusion models in the field of many-to-many image generation, offering a significant leap towards more flexible and context-aware image synthesis. The MIS dataset emerges as a valuable asset for further explorations in this field. However, challenges remain, particularly in generating human faces with higher fidelity and maintaining image quality over prolonged generative sequences. These areas indicate promising directions for future research.

Conclusion

This paper presents a groundbreaking approach to many-to-many image generation using auto-regressive diffusion models. Through the introduction of the MIS dataset and the development of the M2M model, the research opens new pathways for the generation of complex image sets. The demonstrated capacity to adapt to various multi-image generation tasks, combined with the model's robustness to zero-shot generalization, signifies a notable advancement in generative AI. Future work will undoubtedly explore refinements and applications of the M2M model, propelling the field toward ever-more sophisticated image generation capabilities.

Markdown Report Issue