SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model (2212.05034v1)

Published 9 Dec 2022 in cs.CV

Abstract: Generic image inpainting aims to complete a corrupted image by borrowing surrounding information, which barely generates novel content. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content, \eg, a text prompt can be used to describe an object with richer attributes, and a mask can be used to constrain the shape of the inpainted object rather than being only considered as a missing area. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance. While previous work such as DALLE-2 and Stable Diffusion can do text-guided inapinting they do not support shape guidance and tend to modify background texture surrounding the generated object. Our model incorporates both text and shape guidance with precision control. To preserve the background better, we propose a novel training and sampling strategy by augmenting the diffusion U-net with object-mask prediction. Lastly, we introduce a multi-task training strategy by jointly training inpainting with text-to-image generation to leverage more training data. We conduct extensive experiments showing that our model outperforms all baselines in terms of visual quality, mask controllability, and background preservation.

PDF Abstract

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

The paper presents SmartBrush, a novel diffusion-based model for text and shape-guided object inpainting. Unlike traditional image inpainting that utilizes surrounding pixel information to fill a missing image region, this work focuses on generating novel content with flexibility using multi-modal cues. SmartBrush is particularly distinguished for its ability to incorporate both textual descriptions and shape guidance when completing an image.

The key innovation of SmartBrush is the integration of text and precise shape information into a unified inpainting system, which sets it apart from previous models, such as DALLE-2 and Stable Diffusion, which focus predominantly on text-guided generation without shape adherence. In previous methods, modifications often resulted in undesirable deviations where the generated object would alter or mismatch with the surrounding textures and object outlines. To counter this, SmartBrush introduces a diffusion U-net augmented with object mask prediction capabilities, significantly enhancing mask controllability and background preservation.

The authors address several challenges in multi-modal image inpainting, including text semantic misalignment — where the generated image region does not precisely adhere to the specified text — and mask misalignment, which occurs when the generated object doesn't correspond correctly to the intended shape mask. Their approach introduces precision control on shape adherence through a novel training methodology that utilizes different levels of mask precision, ranging from exact shape to coarse bounding boxes. This facilitates nuanced control over how closely the model's output should align with the user-provided mask.

A critical advancement in SmartBrush is its background preservation strategy through multi-task training. By predicting the appropriate object mask during the inpainting process, the model maintains the contextual integrity of the original image's background. This is achieved by replacing the coarse initial mask with a predicted mask during sampling, ensuring that the newly generated content harmoniously integrates with the existing image backdrop.

Quantitative assessments exhibit that SmartBrush surpasses existing baselines across several measures, including visual fidelity, Local FID scores, and CLIP score evaluations, on datasets such as OpenImages and MSCOCO. These results are complemented by a user paper where the generation output quality, realism, and adherence to textual description and shape precision are evidenced to be superior.

The implications of SmartBrush's advancements are multifaceted. Practically, it could improve applications in digital content creation where precise control over image editing is paramount, such as in professional photo editing and design industries. Theoretically, this work contributes to ongoing research into diffusion models, expanding their applications in multi-modal contexts and offering insights into more intricate forms of generative models that accommodate complex contiguity constraints.

Looking forward, potential developments inspired by this paper might explore expanding such inpainting capabilities to video content, thereby offering shape and context-guided frame interpolations in dynamic scenes. Furthermore, exploring enhancements of user interface tools that allow non-expert users to wield these sophisticated model capabilities effortlessly could broaden the accessibility and utility of inpainting technologies across a wider range of applications in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Shaoan Xie (14 papers)
Zhifei Zhang (156 papers)
Zhe Lin (163 papers)
Tobias Hinz (16 papers)
Kun Zhang (353 papers)

Citations (176)

View on Semantic Scholar

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model (2212.05034v1)

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

Related Papers