High-Fidelity Pluralistic Image Completion with Transformers (2103.14031v1)

Published 25 Mar 2021 in cs.CV and cs.GR

Abstract: Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. The proposed method vastly outperforms state-of-the-art methods in terms of three aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet.

Citations (207)

View on Semantic Scholar

Summary

The paper introduces a hybrid framework that combines transformers for global structure modeling with CNNs for local texture refinement to achieve pluralistic image completion.
It demonstrates superior performance with improved PSNR, SSIM, and FID scores compared to state-of-the-art inpainting methods.
The method effectively handles large missing regions while maintaining geometric coherence and diverse, realistic outputs.

High-Fidelity Pluralistic Image Completion with Transformers

Overview

The paper "High-Fidelity Pluralistic Image Completion with Transformers" addresses the challenge of image completion by proposing a novel method that integrates the strengths of both Transformers and Convolutional Neural Networks (CNNs). While CNNs have traditionally been favored for their texture modeling capabilities, they lack in capturing global structures due to their locale inductive biases. Transformers, with their ability to model long-range dependencies and generate diverse results, overcome the limitations associated with CNNs, but they struggle with high-resolution images due to their computational expense. This research introduces a hybrid approach that effectively utilizes transformers for global structure understanding and pluralistic image completion, coupled with CNNs for local texture refinement.

Methodology

The proposed framework is composed of two distinct phases:

Appearance Priors Reconstruction with Transformers: This phase relies on a transformer to generate low-resolution representations called "appearance priors". These priors capture essential global structures and coarse textures of the image. Transformers, through bi-directional attention mechanisms inspired by BERT's MLM objectives, estimate the likelihood of different image regions and can attend to full contextual information, allowing for diverse sampling outcomes.
Guided Upsampling with CNNs: This step focuses on enhancing the details of the lower-resolution priors obtained from the transformer. CNNs are employed to upsample these priors, refining the local texture and ensuring coherence with the non-missing parts of the input image. The upsampling network utilizes a combination of encoder, decoder, and residual blocks to transform these priors into high-fidelity reconstructed images.

Results and Analysis

The paper provides an exhaustive evaluation of the proposed method against state-of-the-art deterministic and pluralistic inpainting approaches such as DeepFillv2, EdgeConnect, and PIC on datasets including FFHQ, Places2, and ImageNet. The method demonstrates a significant improvement in terms of image fidelity, diversity of completion results, and generalization ability over large missing regions, with notable FID score improvements.

Numerical Performance: The proposed model achieves superior PSNR, SSIM, and FID scores across different mask sizes, signaling both higher quality and diversity in image completion, particularly with larger missing areas.
Qualitative Performance: The reconstructions produced are visually more realistic and semantically appropriate compared to other methods. The appearance priors samples lead to plausible variations, enhancing the value of the model for tasks requiring diverse outputs.
Analysis on Robustness and Geometry Understanding: The model's capability to handle large missing regions and retain geometric structures shows its enhanced understanding of global context compared to CNN-only architectures.

Implications and Future Directions

The fusion of transformers and CNNs in this method capitalizes on the transformational synergy where global structural understanding can be consolidated with fine-textured details to solve image completion problems more effectively. These results indicate substantial advancements in image inpainting technologies, increasing their applicability in areas such as content creation and image restoration.

Looking ahead, further exploration into reducing the computational overhead of transformers in high-resolution scenarios could make this hybrid approach more accessible for broader use cases. Advancements in efficient attention mechanisms could significantly impact the deployment of such systems in real-world applications. Additionally, expanding such frameworks to other vision tasks where understanding of both semantic context and fine detail are necessary could further extend the reach of this research in AI.

PDF Markdown