Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model (2407.16982v1)

Published 24 Jul 2024 in cs.CV and cs.AI

Abstract: This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve cumbersome human intervention in specifying bounding boxes or user-scribbled masks. To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. To this end, we curate OABench, an exquisite synthetic dataset by removing objects with advanced image inpainting techniques. OABench comprises 74K real-world tuples of an original image, an inpainted image with the object removed, an object mask, and object descriptions. Trained on OABench using the Stable Diffusion model with an additional mask prediction module, Diffree uniquely predicts the position of the new object and achieves object addition with guidance from only text. Extensive experiments demonstrate that Diffree excels in adding new objects with a high success rate while maintaining background consistency, spatial appropriateness, and object relevance and quality.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel diffusion framework that guides object inpainting using text alone without the need for explicit masks.
It incorporates an Object Mask Predictor and the OABench dataset of 74K tuples to maintain consistent image context during object addition.
Diffree significantly outperforms prior methods, achieving over 98% success with superior spatial and background consistency metrics.

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

In "Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model," the authors present a novel approach to object addition in images using solely text guidance. This paper addresses the challenge of integrating new objects into existing images in a way that maintains consistency in visual context, such as lighting, texture, and spatial location. Existing methods either disrupt the image's background or require cumbersome human-drawn masks. Diffree introduces a diffusion model that predicts the new object's position and integrates it into the image without altering the background undesirably.

Methodology Overview

Diffree extends the capabilities of Text-to-Image (T2I) models by incorporating an object mask predictor module. The authors curated a new dataset, Object Addition Benchmark (OABench), which contains 74K real-world tuples, including original images, inpainted images with objects removed, object masks, and object descriptions. OABench is generated using advanced image inpainting techniques to remove objects from images, capturing the relationship between objects and their context effectively.

The Diffree framework integrates a Stable Diffusion model, augmented with an Object Mask Predictor (OMP) module to achieve text-guided object addition without the need for explicit masks. The diffusion model iteratively denoises latents to generate object masks and subsequently inpaint the specified regions according to textual descriptions. The model is trained using custom loss functions tailored for the diffusion and OMP modules, optimizing the generation of contextually appropriate and visually consistent outputs.

Dataset and Training

OABench's construction leverages existing instance segmentation datasets (e.g., COCO), applying rules to filter and synthesize high-quality training tuples. The process ensures the generated inpainted images retain high background consistency, critical for training the Diffree model.

The model is trained using Stable Diffusion 1.5 weights, with optimizations in batch sizes and learning rates to accommodate the unique demands of text-guided object addition. The training incorporates classifier-free guidance, blending conditional and unconditional diffusion models to balance sample quality and diversity effectively.

Evaluation Metrics

Traditional metrics are insufficient to evaluate this task comprehensively. Instead, the authors introduce a set of evaluation rules leveraging LPIPS for background consistency, GPT4V scores for the reasonableness of object location, Local CLIP Score for text-image correlation, and Local FID for object quality and diversity. A unified metric aggregates these scores with success rates to assess overall performance comprehensively.

Numerical Results

Diffree markedly outperforms prior works in several critical metrics:

Success Rate: Achieves over 98% success in adding objects, substantially higher than the 17-19% success rates of InstructPix2Pix.
Background Consistency: Shows comparable performance with mask-guided methods (LPIPS ≈ 0.07), highlighting its ability to preserve the original context without explicit masks.
Location Reasonableness: Achieves higher GPT4V scores (~3.47), indicating better spatial appropriateness of added objects.
Correlation and Quality: Diffree maintains superior Local FID scores (~57-60) and competitive Local CLIP Scores, affirming the generated objects' quality and relevance.

Implications and Future Work

Diffree's method has significant practical and theoretical implications. Practically, it eliminates the need for labor-intensive mask creation, broadening its accessibility and usability in fields such as advertising, content creation, and virtual staging. Theoretically, it advances the understanding of combining diffusion models with auxiliary prediction modules to enhance image editing tasks.

Future developments could explore integrating Diffree with other methods, such as combining it with AnyDoor for specific object addition or with GPT4V for planning object placements in images. Moreover, continuous improvements in image inpainting techniques could further refine Diffree's outputs, ensuring higher fidelity and contextual relevance.

In conclusion, Diffree represents a substantial advancement in text-guided image editing, marrying diffusion models with innovative object mask prediction techniques to achieve high success in object addition while maintaining visual coherence and quality. The paper lays a robust foundation for future enhancements and applications in AI-driven image editing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/taziku_co/status/1817937969044758986

https://twitter.com/AdeerKhan/status/1816314634829455799

https://twitter.com/shandiando/status/1872506582699045229

https://twitter.com/susumuota/status/1822786367748763741