BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion (2404.04544v1)

Published 6 Apr 2024 in cs.CV and cs.AI

Abstract: Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.

Authors (5)

Gwanghyun Kim (15 papers)
Hayeon Kim (6 papers)
Hoigi Seo (7 papers)
Dong Un Kang (8 papers)
Se Young Chun (50 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces BeyondScene, which overcomes token limitations and low-res artifacts with a staged, hierarchical diffusion strategy.
It employs detailed base image generation and instance-aware refinement to integrate human figures seamlessly into high-resolution scenes.
Experimental validation confirms that BeyondScene outperforms prior methods, setting a new benchmark for text-to-image generation quality.

BeyondScene: Enhanced High-Resolution Human-Centric Scene Generation Leveraging Pretrained Diffusion Models

Introduction to BeyondScene Framework

The paper introduces "BeyondScene," a novel framework designed to address the challenges in generating high-resolution human-centric scenes using existing text-to-image (T2I) diffusion models. BeyondScene uniquely overcomes the constraints of limited training image size and token capacity in text encoders, a common setback in producing detailed and complex scenes with multiple human figures. The proposed methodology is built upon a staged and hierarchical approach which parallels the artistic process of constructing a detailed base followed by gradual refinements, facilitating the creation of scenes in high-resolution that exceed 8K standards.

Core Challenges in Prior Models

Prior methods in T2I diffusion faced significant hurdles:

Limited Resolution and Detail: Previous approaches were confined by the training image sizes, leading to artifacts when images are scaled up.
Text Encoder Capacity: Conventional models were impeded by the restrictive token counts in text encoders, limiting the complexity and detail that could be incorporated into scene descriptions.
Generation of Human-Centric Details: The complexity of accurate human figure generation, including pose, anatomical fidelity, and multiple instances, was inadequately addressed, often resulting in distorted or duplicated human figures.

BeyondScene's Methodological Innovations

Detailed Base Image Generation

BeyondScene initially constructs a detailed base image that focuses on key elements:

Detailed Instance Representation: The framework first generates detailed representations for human figures, overcoming token limitations by focusing on individual elements.
Seamless Integration: Following the creation of human figures, these are integrated into a coherent scene where background and foreground elements are blended using advanced inpainting techniques.
Tone Normalization: This process ensures consistency in style and lighting across the composited scene, which is crucial for natural appearance in the final high-resolution output.

Instance-Aware Hierarchical Enlargement

The transition from base image to high-resolution depiction involves:

High Frequency-injected Forward Diffusion: This novel technique ensures that the image details are not lost or blurred during upscaling, maintaining high fidelity in textures and edges.
Adaptive Joint Diffusion: By adaptively adjusting the diffusion process based on the content (e.g., more detailed processing on human figures), the framework manages to preserve essential details that define the realism and naturalness of human-centric scenes.

Experimental Validation and Results

The effectiveness of BeyondScene is demonstrated through rigorous evaluation against existing state-of-the-art models. The proposed method not only shows superior capability in handling higher resolutions without loss of detail but also significantly improves the correspondence between the generated images and detailed text descriptions. Furthermore, through qualitative comparisons and user studies, BeyondScene consistently outperforms other approaches in generating realistic and natural-looking images.

Implications and Future Directions

BeyondScene sets a new standard for the generation of high-resolution human-centric images in the field of generative AI, particularly within the constraints of pretrained diffusion models. The staged approach of first creating a detailed base and then elaborately enhancing it offers a promising direction for generating complex scenes with multiple instances and interactions. Future research could explore extending this framework to other forms of media content generation or improving the efficiency of the hierarchical enlargement process for real-time applications.

In conclusion, BeyondScene provides a robust framework for advancing the capabilities of text-to-image generation models, pushing the boundaries of resolution, detail, and naturalness in digital image creation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_vztu/status/1809770690180219308

https://twitter.com/_akhaliq/status/1777556394167669057

https://twitter.com/taziku_co/status/1781473474915274932

https://twitter.com/CSVisionPapers/status/1777739503572033784

YouTube

Show All Videos