Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (2203.00867v2)

Published 2 Mar 2022 in cs.CV

Abstract: Image inpainting has made significant advances in recent years. However, it is still challenging to recover corrupted images with both vivid textures and reasonable structures. Some specific methods only tackle regular textures while losing holistic structures due to the limited receptive fields of convolutional neural networks (CNNs). On the other hand, attention-based models can learn better long-range dependency for the structure recovery, but they are limited by the heavy computation for inference with large image sizes. To address these issues, we propose to leverage an additional structure restorer to facilitate the image inpainting incrementally. The proposed model restores holistic image structures with a powerful attention-based transformer model in a fixed low-resolution sketch space. Such a grayscale space is easy to be upsampled to larger scales to convey correct structural information. Our structure restorer can be integrated with other pretrained inpainting models efficiently with the zero-initialized residual addition. Furthermore, a masking positional encoding strategy is utilized to improve the performance with large irregular masks. Extensive experiments on various datasets validate the efficacy of our model compared with other competitors. Our codes are released in https://github.com/DQiaole/ZITS_inpainting.

Citations (122)

View on Semantic Scholar

Summary

The paper introduces a transformer-based inpainting framework that leverages transformer structure restoration (TSR) to accurately recover image structures.
It employs a ZeroRA technique for efficient incremental training that minimizes computational overhead while enhancing pretrained models.
Masking Positional Encoding (MPE) combined with Fourier CNN texture restoration significantly improves artifact reduction and overall image fidelity.

Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding

This paper introduces a novel approach to image inpainting that seeks to improve upon existing methods by leveraging a transformer model for structural enhancement and proposing a zero-initialization strategy for efficient incorporation into pretrained networks. It targets the challenge of recovery in corrupted images by focusing on both textural and structural fidelity, deploying an attention-based model that overcomes the limitations of Convolutional Neural Networks (CNNs).

Methodology

The method is centered around several key components:

Transformer Structure Restoration (TSR): The TSR uses a transformer to learn holistic image structures in a low-resolution sketch space, specifically targeting edges and lines. This approach capitalizes on the transformer’s long-range dependency capabilities to accurately reconstruct structural elements, which are then upsampled to higher resolutions.
Incremental Training with Zero-Initialization (ZeroRA): To efficiently integrate the learned structures into existing inpainting models, a ZeroRA method is introduced. This allows for fast convergence by initializing residual additions with zero, enabling the enhancement of pretrained models without retraining from scratch.
Masking Positional Encoding (MPE): The MPE addresses the issue of missing positional information in large masked regions. It provides additional clues about distance and directionality within these regions, thus reducing artifacts related to spatial positioning.
Fourier CNN Texture Restoration (FTR): The texture restoration utilizes a Fourier convolution-based network to handle frequency domain learning for robust inpainting results. This module benefits from the structural information provided by the TSR and is further augmented by MPE for spatial awareness.

Experimental Results

The model demonstrates its effectiveness across several datasets, including Places2, ShanghaiTech, NYUDepthV2, and MatterPort3D, where it consistently outperformed state-of-the-art competitors. Notably, the improvements are evident in:

Higher Structural Accuracy: The use of transformers significantly enhances the ability to restore holistic structures, addressing both edges and lines with higher precision than traditional CNNs.
Reduced Training Overheads: The ZeroRA strategy ensures that incremental training is both fast and stable, reducing the need for extensive computational resources.
Improved Metric Scores: Quantitative measures, such as PSNR, SSIM, and LPIPS, suggest superior restored image quality, particularly in terms of structure preservation and minimal artifact generation.

Implications

This approach integrates sophisticated structural recovery strategies within the image inpainting domain, promising improvements in recovering missing or corrupted areas. It presents practical implications for real-world applications, such as object removal or photo restoration, where high structural integrity is vital. Theoretically, the successful application of transformers to structural space in inpainting highlights their versatility beyond conventional text and low-resolution tasks.

Future Developments

Future exploration could focus on expanding the transformer’s capabilities to handle even larger image sizes or more complex structures with minimal computational costs. Additionally, further enhancement of the masking positional encoding strategy may unlock better performance in highly corrupted or uniform regions. Adjustments to the ZeroRA methodology could also improve adaptability to other neural architectures beyond FTR.

In conclusion, the paper presents a robust framework for image inpainting that challenges traditional methods by integrating a transformer-based long-range dependency structure with an innovative training strategy. The demonstrated efficacy across varied datasets underscores the potential for this approach to shape future developments in image restoration and related fields.

PDF Markdown