Contextual-based Image Inpainting: Infer, Match, and Translate (1711.08590v5)

Published 23 Nov 2017 in cs.CV

Abstract: We study the task of image inpainting, which is to fill in the missing region of an incomplete image with plausible contents. To this end, we propose a learning-based approach to generate visually coherent completion given a high-resolution image with missing components. In order to overcome the difficulty to directly learn the distribution of high-dimensional image data, we divide the task into inference and translation as two separate steps and model each step with a deep neural network. We also use simple heuristics to guide the propagation of local textures from the boundary to the hole. We show that, by using such techniques, inpainting reduces to the problem of learning two image-feature translation functions in much smaller space and hence easier to train. We evaluate our method on several public datasets and show that we generate results of better visual quality than previous state-of-the-art methods.

Citations (213)

View on Semantic Scholar

Summary

The paper introduces a three-stage deep neural network framework that decomposes image inpainting into inference, matching, and translation tasks.
It refines image textures by swapping patches to ensure high-frequency details are consistent with surrounding content.
Experiments on benchmarks like COCO and ImageNet demonstrate improved perceptual similarity and structural coherence in the inpainted images.

Contextual-based Image Inpainting: Infer, Match, and Translate

The paper "Contextual-based Image Inpainting: Infer, Match, and Translate" by Yuhang Song et al. introduces a robust approach to the task of image inpainting, which involves filling in missing regions of an image with semantically and visually plausible content. The authors propose a multi-stage, learning-based framework that decomposes the high-dimensional image inpainting problem into two manageable sub-tasks, thus improving the training and inference processes of high-resolution images.

Summary of Methodology

The methodology comprises three critical stages: inference, matching, and translation, each driven by a distinct neural network component.

Inference: The initial step employs an Image2Feature network to generate coarse predictions of the missing regions. This step involves training a convolutional neural network that produces a feature map representation of the input image. The inference network facilitates the preservation of high-level structural information within the generated content.
Matching: A novel patch-swap operation is applied to the feature maps, introducing texture refinement into the coarse predictions. It matches neural patches from the known boundary of the image to those within the inpainted area, ensuring that the high-frequency texture details are plausible and coherent with the surrounding context.
Translation: Subsequently, the paper introduces a Feature2Image network responsible for translating the refined feature maps back into a complete image. This network outputs high-resolution images with sharp and consistent textures, surpassing previous inpainting models that often resulted in artifacts and blurry areas.

Implementation and Results

The authors demonstrate the efficacy of their approach using several benchmark datasets, notably COCO and ImageNet-CLS-LOC. They achieve competitive numerical results, maintaining an appealing balance between perceptual similarity (SSIM) and inception scores, which correlate well with human judgment on visual realism.

While the approach reported might not achieve the lowest mean $\ell_1$ error compared to others, such as the global local inpainting (GLI) technique, the superior structural coherence and visual appeal were validated through rigorous user studies. The approach scales well to high-resolution input, demonstrating utility in practical applications like object removal and image restoration in diverse, real-world scenes.

Implications and Future Directions

The work presents significant implications for the image inpainting domain. By effectively addressing the challenges of textural coherence and resolution in image inpainting, the proposed framework opens new pathways for refinements in generative image models. The concept of breaking down high-complexity tasks into simpler, manageable subtasks may inspire similar methodologies in related research areas like texture synthesis and style transfer.

Potential future research could investigate more advanced networks for the initial inference stage or explore alternative feature matching techniques that might improve model accuracy further while reducing computational requirements. Similarly, augmenting the model's generative capabilities to other transformation tasks could leverage the strengths of feature-based transformation frameworks proposed in this paper.

In conclusion, this paper makes a valuable contribution to the research community by offering a comprehensive, scalable solution to image inpainting, utilizing the synergy of structured multi-stage training and deep learning architectures to deliver visually compelling results.

PDF Markdown