Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors (2203.13131v1)

Published 24 Mar 2022 in cs.CV, cs.AI, cs.CL, cs.GR, and cs.LG

Abstract: Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity and text relevancy, several pivotal gaps remain unanswered, limiting applicability and quality. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels, significantly improving visual quality. Through scene controllability, we introduce several new capabilities: (i) Scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation, as demonstrated in the story we wrote.

Citations (465)

View on Semantic Scholar

Summary

The paper proposes a scene-based model that improves controllability by integrating scene layouts with text prompts.
The paper leverages human priors to refine tokenization, ensuring generated images align closely with human visual perceptions.
The paper demonstrates state-of-the-art high-resolution outputs (512x512) with superior FID scores and strong qualitative evaluations.

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

The paper, "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors," presents a sophisticated approach to address several limitations in contemporary text-to-image generation methods. The authors introduce a system designed to enhance controllability, human perception alignment, and output quality, ultimately advancing the field of image synthesis using natural language inputs.

Key Contributions

The authors propose a text-to-image generation model marked by three primary innovations:

Scene-Based Controllability: The method allows users to input a scene layout alongside the text, enabling precise control over the structure and arrangement of image elements. This added control counters the current randomness and weak user influence found in text-only generated images.
Human Perception Alignment: By incorporating domain-specific knowledge, particularly regarding facial features and salient objects, the model enhances the tokenization process. These adjustments ensure that generated images align more closely with human perceptual priorities.
High-Resolution Output: The model achieves state-of-the-art fidelity and resolution, notably pioneering a $512 \times 512$ pixel output, surpassing the typical $256 \times 256$ resolution constraints of previous models.

Methodological Framework

The implementation utilizes an autoregressive transformer framework to synthesize images based on text and optional scene inputs. The paper details the integration of segmentation maps for implicit conditioning, contrasting with the explicit constraints typical in GAN-based models. This flexibility allows the model to produce diverse, structured outputs.

Key elements of the methodology include:

Scene Representation and Tokenization: A modified VQ-VAE processes panoptic, human, and face segmentation maps, enabling the model to honor both scene layout and text inputs.
Emphasis on Human-Centric Features: The model employs explicit loss functions targeting facial and object saliency, guided by pre-trained networks, to ensure critical image regions conform to human perceptual expectations.
Classifier-Free Guidance Adaptation: Extending guidance techniques common in diffusion models to the transformer context facilitates higher fidelity image generations without post-processing filtering.

Experimental Results

The Make-A-Scene model demonstrates superior performance on quality and alignment benchmarks, notably achieving the lowest reported FID scores in both filtered and unfiltered settings. Human evaluation data corroborates these findings, showing significant preference for Make-A-Scene over baselines like DALL-E and CogView in qualitative aspects.

The model's capacity for generating out-of-distribution exemplars and maintaining scene consistency establishes a pathway for story illustration applications, presenting new creative potential by allowing users to sketch scenes alongside textual descriptions.

Implications and Future Directions

The research highlights the influence of integrating human-centric priors in generative models, particularly in enhancing both the technical performance and user experience of AI systems. Future explorations may focus on refining segmentation methodologies or expanding the range of editable elements to encourage broader adoption in creative industries.

The added controllability and quality are promising footholds for further developments in human-computer interaction and creativity facilitation. Potential extensions could explore dynamic scene modifications in real-time or interactive storytelling environments.

Conclusively, Make-A-Scene sets a new standard in guiding and realizing text-to-image generation, contributing significantly to the discourse on how machines comprehend and recreate visual representations through language.

PDF Markdown

Related Papers

YouTube

Show All Videos