Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 156 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Generative Photomontage (2408.07116v3)

Published 13 Aug 2024 in cs.CV and cs.GR

Abstract: Text-to-image models are powerful tools for image creation. However, the generation process is akin to a dice roll and makes it difficult to achieve a single image that captures everything a user wants. In this paper, we propose a framework for creating the desired image by compositing it from various parts of generated images, in essence forming a Generative Photomontage. Given a stack of images generated by ControlNet using the same input condition and different seeds, we let users select desired parts from the generated results using a brush stroke interface. We introduce a novel technique that takes in the user's brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. Our method faithfully preserves the user-selected regions while compositing them harmoniously. We demonstrate that our flexible framework can be used for many applications, including generating new appearance combinations, fixing incorrect shapes and artifacts, and improving prompt alignment. We show compelling results for each application and demonstrate that our method outperforms existing image blending methods and various baselines.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a user-guided framework that employs graph cut segmentation in diffusion feature space to integrate multiple ControlNet outputs seamlessly.
  • It uses a novel feature-space blending technique to harmonize composited image regions, minimizing visible seam artifacts and preserving local realism.
  • Quantitative evaluations and user studies show higher PSNR and lower masked LPIPS, affirming its effectiveness over traditional text-to-image methods.

Generative Photomontage: A User-Guided Framework for Compositing Imagery

Authors: Sean J. Liu, Nupur Kumari, Ariel Shamir, Jun-Yan Zhu

Abstract

The paper "Generative Photomontage" presents a novel approach to fine-grained image control through compositing multiple outputs from ControlNet, enhancing user ability to create desired imagery where single-text-to-image outputs often fall short. This framework introduces a brush stroke interface allowing users to segment and blend preferred regions from a stack of ControlNet-generated images. Utilizing a graph-based optimization within diffusion feature space and a feature-space blending technique, this method seamlessly harmonizes user-selected image regions, affording greater control over image synthesis.

Introduction

Text-to-image models such as ControlNet have shown potential in generating high-quality images from linguistic prompts. However, due to inherent ambiguities in the generative mapping from text to the pixel space, the outcome often fails to meet user expectations fully. Single-instance image generation—which is probabilistic by nature—suffers from the same limitations. The present paper advocates an alternative strategy by treating generated images as intermediate results, providing users the flexibility to composite different parts to achieve the desired final image—referred to as Generative Photomontage.

Methodology

Segmentation via Graph Cut in Feature Space

The core technical contribution is leveraging diffusion features for graph cut-based segmentation. The method starts with generating a stack of images using the same prompt but different seeds via ControlNet. Users then mark desired regions using brush strokes. A multi-label graph cut algorithm optimizes the segmentation based on diffusion features, ensuring that selected image regions conform to user inputs while minimizing visible seams.

Feature-Space Blending

Once segmentation is achieved, a novel composite rendering technique blends these regions. This is accomplished by injecting composite self-attention features (QcompQ^{\text{comp}}, KcompK^{\text{comp}}, VcompV^{\text{comp}}) into ControlNet's U-Net layers during the denoising process. This feature-space blending enables superior harmonization of selected regions, preserving local details more effectively compared to traditional pixel-space methods.

Results

The authors showcase the application's versatility across three primary domains:

  1. Appearance Mixing: Generating creative images by combining distinct attributes from multiple images. Examples include architectural design exploration and generating variants by mixing different colors and textures (e.g., combining different feather colors in birds).
  2. Shape and Artifacts Correction: Correcting undesired shapes and artifacts in generated images. For instance, adjusting object outlines or rectifying structural anomalies by replacing erroneous regions with more suitable segments from other images.
  3. Prompt Alignment: Leveraging the capability to align the final composite more closely with complex prompts by integrating simpler generated segments.

Evaluation

Quantitative assessments underscore the method's effectiveness. The authors report high PSNR and low masked LPIPS values, indicating superior performance in preserving local realism and minimizing seam artifacts. Additionally, a user paper further corroborates the method's blending efficacy and realism, affirming its superiority over existing baseline methods such as Interactive Digital Photomontage, Blended Latent Diffusion, and MasaCtrl+ControlNet.

Discussion

The research implies significant practical implications, providing an enhanced user-experience in image synthesis through increased specificity and control. Theoretically, it furthers the understanding of how diffusion models' intermediate representations can be efficaciously harnessed for post-generation refinement.

Future Work

Future investigations could focus on automating aspects of user intervention, improving spatial consistency across diverse image stacks, and expanding this compositing technique to video frames for dynamic content generation.

Conclusion

"Generative Photomontage" offers a compelling advancement in user-guided image creation, combining robust segmentation and seamless blending in diffusion space. This nuanced interplay of user input and model output paves the way for more precise and creative image synthesis applications, reflecting a significant step forward in the domain of computer-generated imagery.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.