Generative Image Inpainting with Contextual Attention (1801.07892v2)

Published 24 Jan 2018 in cs.CV and cs.GR

Abstract: Recent deep learning based approaches have shown promising results for the challenging task of inpainting large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted structures or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness of convolutional neural networks in explicitly borrowing or copying information from distant spatial locations. On the other hand, traditional texture and patch synthesis approaches are particularly suitable when it needs to borrow textures from the surrounding regions. Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions. The model is a feed-forward, fully convolutional neural network which can process images with multiple holes at arbitrary locations and with variable sizes during the test time. Experiments on multiple datasets including faces (CelebA, CelebA-HQ), textures (DTD) and natural images (ImageNet, Places2) demonstrate that our proposed approach generates higher-quality inpainting results than existing ones. Code, demo and models are available at: https://github.com/JiahuiYu/generative_inpainting.

Authors (6)

Jiahui Yu (65 papers)
Zhe Lin (163 papers)
Jimei Yang (58 papers)
Xiaohui Shen (67 papers)
Xin Lu (165 papers)
Thomas S. Huang (65 papers)

Citations (2,164)

View on Semantic Scholar

Summary

The paper introduces a two-stage CNN with a novel contextual attention layer to enhance the synthesis of missing image regions.
It uses patch extraction, softmax normalization, and spatial propagation to borrow and integrate features from surrounding areas.
Experimental results demonstrate improved quality with lower error metrics and higher PSNR across datasets like CelebA and ImageNet.

Generative Image Inpainting with Contextual Attention

The paper "Generative Image Inpainting with Contextual Attention" presents a novel approach to addressing the challenge of image inpainting, which involves filling in large missing regions in an image such that the synthesized structures and textures are both visually plausible and consistent with the surrounding areas. Traditional methods like texture synthesis and patch matching have limitations in hallucinating novel content and capturing high-level semantics. Conversely, convolutional neural networks (CNNs) struggle to model long-term correlations between distant spatial regions, often resulting in artifacts and blurry textures.

The authors overcome these limitations by proposing a two-stage, fully convolutional neural network architecture incorporating a novel contextual attention layer. This layer enhances the model's ability to explicitly attend to and use surrounding image features for filling in missing regions.

Core Network Architecture

The proposed network comprises two stages: a coarse stage and a refinement stage. The coarse stage generates a rough initial prediction using a dilated convolutional network trained with reconstruction loss. The refinement stage integrates the contextual attention mechanism and further refines the output.

The contextual attention layer, a key contribution of this work, enables the model to borrow patches from distant spatial locations in a learned manner. It achieves this through:

Patch Extraction and Convolution: Extracting patches from known regions as convolutional filters to compute the similarity between generated and known patches.
Softmax and Deconvolution: Applying softmax to normalize these similarities and reusing the patches for deconvolution to reconstruct the generated regions.
Spatial Propagation: Encouraging spatial coherence through left-right and top-down propagation over the attention map.

Implementation Details

The network improvements include the use of a two-stage architecture, ELUs for activation functions, mirror padding for all convolution layers, and WGAN-GP for adversarial training. The network is trained end-to-end with a combination of pixel-wise $\ell_1$ loss and two Wasserstein GAN losses – one global and one local. This training setup stabilizes the convergence and achieves better visual quality than existing methods.

Experimental Results

The proposed method is evaluated on multiple datasets including CelebA, CelebA-HQ, DTD textures, ImageNet, and Places2. The results demonstrate that the network with contextual attention significantly outperforms the baseline two-stage network and other state-of-the-art methods. This is evidenced by both qualitative and quantitative results, the latter reporting lower $\ell_1$ and $\ell_2$ errors, higher PSNR, and competitive total variation loss. Visualizations of the attention maps indicate the network's ability to consistently attend to relevant background regions, leading to more realistic inpainting.

Implications and Future Work

This work has practical implications for various applications such as photo editing, computational photography, and image-based rendering. On a theoretical level, it highlights the importance of integrating long-range dependencies in convolutional network architectures, advancing the understanding of spatial attention mechanisms in deep learning.

Future work could explore extending this model to very high-resolution images using progressive training techniques or applying the contextual attention mechanism to other tasks like super-resolution and guided image editing.

Overall, this paper provides a comprehensive solution to the challenge of generative inpainting by integrating a structured attention mechanism with a robust convolutional network architecture, resulting in significantly improved performance and training efficiency.

PDF Markdown

Related Papers

GitHub

GitHub - JiahuiYu/generative_inpainting: DeepFill v1/v2 with Contextual Attention and Gated Convolution, CVPR 2018, and ICCV 2019 Oral (3,267 stars)

YouTube

Show All Videos