Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis (2401.09048v1)

Published 17 Jan 2024 in cs.CV

Abstract: Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce \textit{depth disentanglement training} to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce \textit{soft guidance}, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, \textsc{Compose and Conquer (CnC)}, unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/

References (43)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces Compose and Conquer, a novel framework that combines local and global fusers for depth-aware image synthesis.
The local fuser utilizes depth disentanglement training with synthetic image triplets to accurately capture the Z-axis placement of objects.
The global fuser applies 'soft guidance' to inject semantic styles into specific regions, outperforming baseline models in replicating depth and structural details.

Introduction

The world of generative AI has made significant strides, especially with the advent of text-conditional diffusion models. These intriguing models can take a text-based prompt and generate corresponding images by gradually refining noise into detailed visuals. With their growth in popularity, scientists have been working to enhance the specificity with which these models can generate content. Recent breakthroughs have provided additional condition signals to enhance the accuracy of the layout representation in the images produced by these models. Yet, two major challenges remain: Firstly, the inability to effectively represent three-dimensional object placement, often leading to images that fail to reflect the depth-aware positioning of objects. Secondly, the task of applying global semantic styles from multiple images to specified regions of the target image has proven to be a complex aspect to control.

Methodology

In response to these challenges, a new framework called Compose and Conquer (CnC) has been developed, marked by its two-fold approach consisting of a local fuser and a global fuser. The local fuser is designed to capture the Z-axis positioning of objects through depth disentanglement training (DDT). DDT leverages synthetic image triplets, heightening the model's understanding of the 3D spatial relationship between objects. The global fuser, on the other hand, leverages a novel technique referred to as 'soft guidance.' This technique helps localize the global semantic conditions, enforcing their influence on specific regions without depending on clear structural signals.

Results

The CnC model brings together the local and global fuser in a harmonious interplay that allows users to inject different global semantics into localized objects within an image, offering extensive creative control. It excels in quantitative evaluations, surpassing other baseline models in various metrics, particularly when it comes to reproducing depth perspectives and adhering to structural information from given conditions.

Discussion

While CnC introduces a breakthrough method for three-dimensional image synthesis with precise control over depth and semantics, there are built-in constraints. The current framework can handle a limited number of conditions, and spatial disentanglement is restricted mainly to foreground and background representation. The blending of real-life dataset advantages and user-preferred content generation holds promise for future developments, including the expansion of depth representation and the introduction of intermediate spatial planes.

Conclusion

Overall, CnC represents a significant advancement in the ability of AI to generate depth-aware images that accurately reflect the conditions derived from text, depth maps, and exemplar images. As the framework evolves, it holds the potential to transform content generation, making it an indispensable tool for creatives seeking to realize complex visual concepts grounded in three-dimensional reality.

PDF Markdown

Related Papers

GitHub

GitHub - tomtom1103/compose-and-conquer: [ICLR 2024] Official repo. for Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis (105 stars)

Tweets

https://twitter.com/_akhaliq/status/1747857732818854040

https://twitter.com/fly51fly/status/1748099724052488277

https://twitter.com/tomcaaaat_/status/1747911338545066113

https://twitter.com/JungWooHa2/status/1749761692782018915

https://twitter.com/WilliamLamkin/status/1748156959486812569