Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis (2401.09048v1)

Published 17 Jan 2024 in cs.CV

Abstract: Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce \textit{depth disentanglement training} to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce \textit{soft guidance}, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, \textsc{Compose and Conquer (CnC)}, unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1209–1218, 2018.
  2. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  3. Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  682–691, 2019.
  4. Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546, 2023.
  5. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  6. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  7. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  8. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  9. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  10. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  11. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022.
  12. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  13. Nsml: Meet the mlaas platform with a real-world case study. arXiv preprint arXiv:1810.09957, 2018.
  14. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.
  15. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22511–22521, 2023.
  16. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
  17. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  18. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  19. Making images real again: A comprehensive survey on deep image composition. arXiv preprint arXiv:2106.14490, 2021.
  20. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2337–2346, 2019.
  21. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  22. U2-net: Going deeper with nested u-structure for salient object detection. Pattern recognition, 106:107404, 2020.
  23. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  24. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  25. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  26. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  27. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  28. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  29. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  30. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  31. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265. PMLR, 2015.
  32. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
  33. Objectstitch: Generative object compositing. arXiv preprint arXiv:2212.00932, 2022.
  34. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  2149–2159, 2022.
  35. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  36. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  37. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  38. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18381–18391, 2023.
  39. Scenecomposer: Any-level semantic image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22468–22478, 2023.
  40. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  41. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.
  42. Uni-controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems, 2023.
  43. Rgb-d salient object detection: A survey. Computational Visual Media, 7:37–69, 2021.
Citations (2)

Summary

  • The paper introduces Compose and Conquer, a novel framework that combines local and global fusers for depth-aware image synthesis.
  • The local fuser utilizes depth disentanglement training with synthetic image triplets to accurately capture the Z-axis placement of objects.
  • The global fuser applies 'soft guidance' to inject semantic styles into specific regions, outperforming baseline models in replicating depth and structural details.

Introduction

The world of generative AI has made significant strides, especially with the advent of text-conditional diffusion models. These intriguing models can take a text-based prompt and generate corresponding images by gradually refining noise into detailed visuals. With their growth in popularity, scientists have been working to enhance the specificity with which these models can generate content. Recent breakthroughs have provided additional condition signals to enhance the accuracy of the layout representation in the images produced by these models. Yet, two major challenges remain: Firstly, the inability to effectively represent three-dimensional object placement, often leading to images that fail to reflect the depth-aware positioning of objects. Secondly, the task of applying global semantic styles from multiple images to specified regions of the target image has proven to be a complex aspect to control.

Methodology

In response to these challenges, a new framework called Compose and Conquer (CnC) has been developed, marked by its two-fold approach consisting of a local fuser and a global fuser. The local fuser is designed to capture the Z-axis positioning of objects through depth disentanglement training (DDT). DDT leverages synthetic image triplets, heightening the model's understanding of the 3D spatial relationship between objects. The global fuser, on the other hand, leverages a novel technique referred to as 'soft guidance.' This technique helps localize the global semantic conditions, enforcing their influence on specific regions without depending on clear structural signals.

Results

The CnC model brings together the local and global fuser in a harmonious interplay that allows users to inject different global semantics into localized objects within an image, offering extensive creative control. It excels in quantitative evaluations, surpassing other baseline models in various metrics, particularly when it comes to reproducing depth perspectives and adhering to structural information from given conditions.

Discussion

While CnC introduces a breakthrough method for three-dimensional image synthesis with precise control over depth and semantics, there are built-in constraints. The current framework can handle a limited number of conditions, and spatial disentanglement is restricted mainly to foreground and background representation. The blending of real-life dataset advantages and user-preferred content generation holds promise for future developments, including the expansion of depth representation and the introduction of intermediate spatial planes.

Conclusion

Overall, CnC represents a significant advancement in the ability of AI to generate depth-aware images that accurately reflect the conditions derived from text, depth maps, and exemplar images. As the framework evolves, it holds the potential to transform content generation, making it an indispensable tool for creatives seeking to realize complex visual concepts grounded in three-dimensional reality.