GLoD: Composing Global Contexts and Local Details in Image Generation (2404.15447v1)

Published 23 Apr 2024 in cs.CV and cs.AI

Abstract: Diffusion models have demonstrated their capability to synthesize high-quality and diverse images from textual prompts. However, simultaneous control over both global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) still remains a significant challenge. The models often fail to understand complex descriptions involving multiple objects and reflect specified visual attributes to wrong targets or ignore them. This paper presents Global-Local Diffusion (\textit{GLoD}), a novel framework which allows simultaneous control over the global contexts and the local details in text-to-image generation without requiring training or fine-tuning. It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process using pre-trained diffusion models. Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities. Our quantitative and qualitative evaluations demonstrate that GLoD effectively generates complex images that adhere to both user-provided object interactions and object details.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (32)

Authors (1)

Moyuru Yamada (7 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/CSVisionPapers/status/1783601252565147986

[2404.15447] GLoD: Composing Global Contexts and Local Details in Image Generation (1 point, 0 comments)

GLoD: Composing Global Contexts and Local Details in Image Generation (2404.15447v1)

Related Papers

Tweets

Reddit