- The paper introduces a two-stage CNN with a novel contextual attention layer to enhance the synthesis of missing image regions.
- It uses patch extraction, softmax normalization, and spatial propagation to borrow and integrate features from surrounding areas.
- Experimental results demonstrate improved quality with lower error metrics and higher PSNR across datasets like CelebA and ImageNet.
Generative Image Inpainting with Contextual Attention
The paper "Generative Image Inpainting with Contextual Attention" presents a novel approach to addressing the challenge of image inpainting, which involves filling in large missing regions in an image such that the synthesized structures and textures are both visually plausible and consistent with the surrounding areas. Traditional methods like texture synthesis and patch matching have limitations in hallucinating novel content and capturing high-level semantics. Conversely, convolutional neural networks (CNNs) struggle to model long-term correlations between distant spatial regions, often resulting in artifacts and blurry textures.
The authors overcome these limitations by proposing a two-stage, fully convolutional neural network architecture incorporating a novel contextual attention layer. This layer enhances the model's ability to explicitly attend to and use surrounding image features for filling in missing regions.
Core Network Architecture
The proposed network comprises two stages: a coarse stage and a refinement stage. The coarse stage generates a rough initial prediction using a dilated convolutional network trained with reconstruction loss. The refinement stage integrates the contextual attention mechanism and further refines the output.
The contextual attention layer, a key contribution of this work, enables the model to borrow patches from distant spatial locations in a learned manner. It achieves this through:
- Patch Extraction and Convolution: Extracting patches from known regions as convolutional filters to compute the similarity between generated and known patches.
- Softmax and Deconvolution: Applying softmax to normalize these similarities and reusing the patches for deconvolution to reconstruct the generated regions.
- Spatial Propagation: Encouraging spatial coherence through left-right and top-down propagation over the attention map.
Implementation Details
The network improvements include the use of a two-stage architecture, ELUs for activation functions, mirror padding for all convolution layers, and WGAN-GP for adversarial training. The network is trained end-to-end with a combination of pixel-wise ℓ1 loss and two Wasserstein GAN losses – one global and one local. This training setup stabilizes the convergence and achieves better visual quality than existing methods.
Experimental Results
The proposed method is evaluated on multiple datasets including CelebA, CelebA-HQ, DTD textures, ImageNet, and Places2. The results demonstrate that the network with contextual attention significantly outperforms the baseline two-stage network and other state-of-the-art methods. This is evidenced by both qualitative and quantitative results, the latter reporting lower ℓ1 and ℓ2 errors, higher PSNR, and competitive total variation loss. Visualizations of the attention maps indicate the network's ability to consistently attend to relevant background regions, leading to more realistic inpainting.
Implications and Future Work
This work has practical implications for various applications such as photo editing, computational photography, and image-based rendering. On a theoretical level, it highlights the importance of integrating long-range dependencies in convolutional network architectures, advancing the understanding of spatial attention mechanisms in deep learning.
Future work could explore extending this model to very high-resolution images using progressive training techniques or applying the contextual attention mechanism to other tasks like super-resolution and guided image editing.
Overall, this paper provides a comprehensive solution to the challenge of generative inpainting by integrating a structured attention mechanism with a robust convolutional network architecture, resulting in significantly improved performance and training efficiency.