Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Move Anything with Layered Scene Diffusion (2404.07178v1)

Published 10 Apr 2024 in cs.CV

Abstract: Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in less than a second.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Multidiffusion: Fusing diffusion paths for controlled image generation. In Proceedings of the 23rd International Conference on Machine Learning, 2023.
  2. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  3. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  4. Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023a.
  5. Gentron: Delving deep into diffusion transformers for image and video generation. arXiv preprint arXiv:2312.04557, 2023b.
  6. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
  7. Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.
  8. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  9. Blobgan: Spatially disentangled scene representations. In European Conference on Computer Vision, pages 616–635. Springer, 2022.
  10. Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
  11. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022.
  12. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  13. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  14. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  15. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  16. EVA3d: Compositional 3d human generation from 2d image collections. In International Conference on Learning Representations, 2023.
  17. Scene collaging: Analysis and synthesis of natural images with semantic layers. In Proceedings of the IEEE International Conference on Computer Vision, pages 3048–3055, 2013.
  18. Layered neural atlases for consistent video editing. ACM Transactions on Graphics (TOG), 40(6):1–12, 2021.
  19. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
  20. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
  21. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  22. Layered neural rendering for retiming people in video. arXiv preprint arXiv:2009.07833, 2020.
  23. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  24. Playable video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10061–10070, 2021.
  25. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  26. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  27. Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421, 2023a.
  28. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023b.
  29. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  30. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453–11464, 2021.
  31. An analysis system for scenes containing objects with substructures. In Proceedings of the Fourth International Joint Conference on Pattern Recognitions, pages 752–754, 1978.
  32. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  33. Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218, 2023.
  34. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  35. Grounded sam: Assembling open-world models for diverse visual tasks, 2024.
  36. High-resolution image synthesis with latent diffusion models. arxiv. arXiv preprint arXiv:2112.10752, 2021.
  37. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  38. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  39. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  40. Collage diffusion. arXiv preprint arXiv:2303.00262, 2023.
  41. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. arXiv preprint arXiv:2306.14435, 2023.
  42. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  43. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  44. Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881, 2023.
  45. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  46. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  47. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023.
  48. Improving gan equilibrium by raising spatial awareness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11285–11293, 2022.
  49. Discoscene: Spatially disentangled generative radiance fields for controllable 3d-aware scene synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4402–4412, 2023.
  50. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
  51. Semantic hierarchy emerges in deep generative representations for scene synthesis. International Journal of Computer Vision, 129:1451–1466, 2021.
  52. Adding conditional control to text-to-image diffusion models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiawei Ren (33 papers)
  2. Mengmeng Xu (27 papers)
  3. Jui-Chieh Wu (4 papers)
  4. Ziwei Liu (368 papers)
  5. Tao Xiang (324 papers)
  6. Antoine Toisoul (9 papers)
Citations (6)

Summary

  • The paper introduces a training-free, layered scene representation that enables spatial-content disentanglement for interactive scene editing.
  • It employs diffusion sampling optimization to allow moving, resizing, cloning, and restyling of objects within generated images.
  • Benchmark tests with 1,000 text prompts and 5,000 images demonstrate state-of-the-art performance in scene generation and editing tasks.

SceneDiffusion: Training-free Controllable Scene Generation with Text-to-Image Diffusion Models

Overview

Recent advances in diffusion models have demonstrated unprecedented quality in image generation tasks. However, the ability to freely rearrange and edit the generated images' layouts remains a challenging problem. This paper introduces SceneDiffusion, a novel framework that optimizes a layered scene representation during the diffusion sampling process to enable wide-ranging spatial editing operations such as moving, resizing, cloning, and restyling of objects within the generated scenes. Importantly, this method achieves spatial-content disentanglement in generated scenes, allowing for interactive manipulations and in-the-wild image editing without the need for additional model training, paired data, or specific architecture designs of denoisers.

Key Contributions

  • Layered Scene Representation Optimization: SceneDiffusion leverages a layered representation where each layer corresponds to an object characterized by its mask, position, and text description. This representation facilitates object occlusions handling through depth ordering and allows for analytic optimization of scene layouts during the diffusion process.
  • Spatial Editing Capabilities: The method supports extensive spatial and appearance editing operations. Objects within a scene can be freely moved, resized, and cloned. Additionally, objects can undergo layer-wise appearance changes, including restyling and replacement, based on text descriptions.
  • Training-free Approach: SceneDiffusion optimizes scene representation directly during the sampling process of a pretrained text-to-image diffusion model, negating the need for fintuning or test-time optimization on specific data. This training-free approach ensures compatibility with general diffusion models and achieves interactive performance on a single GPU.
  • Benchmark Development: An evaluation benchmark was created featuring 1,000 text prompts and over 5,000 images with associated metadata. SceneDiffusion demonstrates state-of-the-art performance on this benchmark for both scene generation and spatial editing tasks, showcasing the method's effectiveness and broad applicability.

Implications and Speculations on Future AI Developments

SceneDiffusion's introduction of a training-free, optimization-based approach to manipulate scene layouts presents significant implications for both theoretical and practical developments in AI. Theoretically, it advances our understanding of spatial-content disentanglement in generative models, suggesting that high-fidelity scene manipulation is achievable without model retraining. Practically, it offers a new tool for creators and developers, potentially transforming content creation workflows in gaming, virtual reality, and film production by allowing for rapid prototype testing and iterative design directly on generated imagery.

Future research may explore the integration of SceneDiffusion with other generative frameworks beyond diffusion models, such as GANs or VQ-VAE-based models, to further enhance the flexibility and fidelity of generative content manipulation. Additionally, refining and extending the layered scene representation to support more granular control over complex features such as lighting and texture, and investigating the incorporation of real-world physics for more realistic interactions between objects, are promising directions. The fusion of SceneDiffusion's approach with recent advancements in unsupervised learning could also lead to more intuitive and human-like understanding and editing of generated scenes by AI systems.

Conclusion

SceneDiffusion represents a significant step forward in controllable scene generation. By optimizing a layered scene representation during the diffusion process, it enables a wide array of editing operations that were previously challenging to achieve. Its training-free nature, compatibility with general diffusion models, and interactive performance open new possibilities for creative and practical applications. The development of a dedicated evaluation benchmark and the method’s demonstrated superior performance underscore its potential to shape future research and applications in generative modeling and beyond.

X Twitter Logo Streamline Icon: https://streamlinehq.com