Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers (2505.04718v1)

Published 7 May 2025 in cs.CV and cs.LG

Abstract: We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary LLMs for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source LLMs to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from LLMs can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.

PDF Abstract

Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

The paper introduces LayouSyn, a novel approach for text-to-layout generation in natural scenes, leveraging diffusion Transformers and lightweight, open-source LLMs. This research addresses the limitations of existing scene layout generation methods that primarily rely on closed-vocabulary models or proprietary LLMs. These approaches restrict the modeling capabilities and applicability in controllable image generation tasks where users require explicit control over spatial and numerical elements in the generated images.

The Approach

LayouSyn represents a significant advancement in the field of open-vocabulary text-to-layout generation through its dual-component framework. The first component utilizes open-source LLMs to extract scene elements from text prompts. This choice eschews reliance on proprietary LLMs, thereby increasing transparency and accessibility. The second component introduces an aspect-aware diffusion Transformer architecture tailored for conditional layout generation. This model frames the layout generation task as a two-step process: extracting relevant object descriptions from text prompts and generating the scene layout conditioned on these descriptions.

Numerical and Spatial Reasoning Capabilities

Extensive experiments validate LayouSyn's superior capability to generate semantically and geometrically plausible layouts. Notably, it achieves state-of-the-art performance on spatial and numerical reasoning benchmarks, effectively generating layouts that adhere to specific spatial constraints and object counts detailed in the prompts. This capability is crucial for applications in automated image editing and other domains requiring precise interaction between scene elements.

Comparison with Existing Methods

LayouSyn is evaluated against several benchmarks including the NSR-1K and COCO-GR datasets, showcasing improved layout quality. The approach leverages a scaling factor in the noise schedule for gradual denoising, refining information destruction during early diffusion steps. This refinement is essential for handling layout generation tasks that require a balance between generating realistic bounding boxes and adhering to scene semantics.

Key Findings and Implications

Open-Source LLM Utilization: The research highlights that smaller, open-source LLMs can effectively generate object descriptions from text prompts. This finding implies a potential for large scale adoption of open-source models in scene layout generation tasks, enhancing the transparency and cost-effectiveness of such applications.
Flexible Aspect Ratio Generation: By conditioning on the aspect ratio during layout generation, LayouSyn demonstrates flexibility in adapting to various layout dimensions. This capability supports diverse applications where aspect ratio has semantic relevance, such as advertisement and media content creation.
LLM Integration for Enhanced Results: The model can initialize and refine layouts produced by LLMs, demonstrating that integration between LayouSyn and LLMs is feasible for increasing layout generation efficiency and fidelity.

Future Directions

The potential for LayouSyn in practical applications is vast, with suggested exploration into additional geometric constraints, such as depth maps, for enhanced occlusion handling. Moreover, its versatility in generating complex natural scene layouts paves the way for cross-disciplinary applications in fields such as virtual environment creation, autonomous navigation systems, and dynamic scene generation for visual storytelling.

In summary, LayouSyn contributes significant enhancements in the field of natural scene layout generation, overcoming existing model limitations and propelling transformative applications across AI-driven creative and analytical fields. The research opens further avenues for refining controllable image generation systems, with promising implications for both theoretical advancements and practical deployments in various industries.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Divyansh Srivastava (3 papers)
Xiang Zhang (395 papers)
He Wen (22 papers)
Chenru Wen (1 paper)
Zhuowen Tu (80 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos