Move Anything with Layered Scene Diffusion (2404.07178v1)

Published 10 Apr 2024 in cs.CV

Abstract: Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in less than a second.

References (52)

Authors (6)

Jiawei Ren (33 papers)
Mengmeng Xu (27 papers)
Jui-Chieh Wu (4 papers)
Ziwei Liu (368 papers)
Tao Xiang (324 papers)
Antoine Toisoul (9 papers)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a training-free, layered scene representation that enables spatial-content disentanglement for interactive scene editing.
It employs diffusion sampling optimization to allow moving, resizing, cloning, and restyling of objects within generated images.
Benchmark tests with 1,000 text prompts and 5,000 images demonstrate state-of-the-art performance in scene generation and editing tasks.

SceneDiffusion: Training-free Controllable Scene Generation with Text-to-Image Diffusion Models

Overview

Recent advances in diffusion models have demonstrated unprecedented quality in image generation tasks. However, the ability to freely rearrange and edit the generated images' layouts remains a challenging problem. This paper introduces SceneDiffusion, a novel framework that optimizes a layered scene representation during the diffusion sampling process to enable wide-ranging spatial editing operations such as moving, resizing, cloning, and restyling of objects within the generated scenes. Importantly, this method achieves spatial-content disentanglement in generated scenes, allowing for interactive manipulations and in-the-wild image editing without the need for additional model training, paired data, or specific architecture designs of denoisers.

Key Contributions

Layered Scene Representation Optimization: SceneDiffusion leverages a layered representation where each layer corresponds to an object characterized by its mask, position, and text description. This representation facilitates object occlusions handling through depth ordering and allows for analytic optimization of scene layouts during the diffusion process.
Spatial Editing Capabilities: The method supports extensive spatial and appearance editing operations. Objects within a scene can be freely moved, resized, and cloned. Additionally, objects can undergo layer-wise appearance changes, including restyling and replacement, based on text descriptions.
Training-free Approach: SceneDiffusion optimizes scene representation directly during the sampling process of a pretrained text-to-image diffusion model, negating the need for fintuning or test-time optimization on specific data. This training-free approach ensures compatibility with general diffusion models and achieves interactive performance on a single GPU.
Benchmark Development: An evaluation benchmark was created featuring 1,000 text prompts and over 5,000 images with associated metadata. SceneDiffusion demonstrates state-of-the-art performance on this benchmark for both scene generation and spatial editing tasks, showcasing the method's effectiveness and broad applicability.

Implications and Speculations on Future AI Developments

SceneDiffusion's introduction of a training-free, optimization-based approach to manipulate scene layouts presents significant implications for both theoretical and practical developments in AI. Theoretically, it advances our understanding of spatial-content disentanglement in generative models, suggesting that high-fidelity scene manipulation is achievable without model retraining. Practically, it offers a new tool for creators and developers, potentially transforming content creation workflows in gaming, virtual reality, and film production by allowing for rapid prototype testing and iterative design directly on generated imagery.

Future research may explore the integration of SceneDiffusion with other generative frameworks beyond diffusion models, such as GANs or VQ-VAE-based models, to further enhance the flexibility and fidelity of generative content manipulation. Additionally, refining and extending the layered scene representation to support more granular control over complex features such as lighting and texture, and investigating the incorporation of real-world physics for more realistic interactions between objects, are promising directions. The fusion of SceneDiffusion's approach with recent advancements in unsupervised learning could also lead to more intuitive and human-like understanding and editing of generated scenes by AI systems.

Conclusion

SceneDiffusion represents a significant step forward in controllable scene generation. By optimizing a layered scene representation during the diffusion process, it enables a wide array of editing operations that were previously challenging to achieve. Its training-free nature, compatibility with general diffusion models, and interactive performance open new possibilities for creative and practical applications. The development of a dedicated evaluation benchmark and the method’s demonstrated superior performance underscore its potential to shape future research and applications in generative modeling and beyond.

PDF Markdown

Tweets

https://twitter.com/liuziwei7/status/1779155077808226734