Multi-Channel Attention Selection GANs for Guided Image-to-Image Translation (2002.01048v2)

Published 3 Feb 2020 in cs.CV, cs.LG, and eess.IV

Abstract: We propose a novel model named Multi-Channel Attention Selection Generative Adversarial Network (SelectionGAN) for guided image-to-image translation, where we translate an input image into another while respecting an external semantic guidance. The proposed SelectionGAN explicitly utilizes the semantic guidance information and consists of two stages. In the first stage, the input image and the conditional semantic guidance are fed into a cycled semantic-guided generation network to produce initial coarse results. In the second stage, we refine the initial results by using the proposed multi-scale spatial pooling & channel selection module and the multi-channel attention selection module. Moreover, uncertainty maps automatically learned from attention maps are used to guide the pixel loss for better network optimization. Exhaustive experiments on four challenging guided image-to-image translation tasks (face, hand, body, and street view) demonstrate that our SelectionGAN is able to generate significantly better results than the state-of-the-art methods. Meanwhile, the proposed framework and modules are unified solutions and can be applied to solve other generation tasks such as semantic image synthesis. The code is available at https://github.com/Ha0Tang/SelectionGAN.

Authors (3)

Hao Tang (379 papers)
Philip H. S. Torr (219 papers)
Nicu Sebe (270 papers)

Citations (36)

View on Semantic Scholar

Summary

The paper introduces SelectionGAN, a novel two-stage model that integrates external semantic guidance with multi-channel attention to refine image outputs.
The model leverages a cycled semantic-guided generation and uncertainty-based pixel optimization to improve image details and semantic consistency.
Experiments demonstrate significant improvements in SSIM, PSNR, and FID across tasks like pose-guided person image generation and facial expression translation.

An Overview of Multi-Channel Attention Selection GANs for Guided Image-to-Image Translation

The paper presents a novel approach to guided image-to-image translation using a model named Multi-Channel Attention Selection Generative Adversarial Network (SelectionGAN). This work advances the field by incorporating external semantic guidance into the translation process, thereby facilitating the transformation of an input image into another while preserving semantic guidance information.

Model Structure and Methodology

SelectionGAN Architecture:

SelectionGAN operates in two distinctive stages. In the initial stage, the model employs a cycled semantic-guided generation network to create preliminary coarse outputs from an input image and conditional semantic guidance. This two-step process enhances the image and guidance domains' connectivity, improving semantic consistency and dedicating sufficient granular detail to the generated images. The second stage employs a multi-scale spatial pooling content channel selection module alongside a multi-channel attention selection module to refine these outputs. A significant innovation comes with the introduction of the uncertainty maps derived from attention maps for optimizing pixel loss within the network, thereby addressing inaccuracies that might stem from the semantic guidance information.

The selection module allows for enhanced image details and refinement by managing multiple intermediate outputs, which provides more explicit handling of complex image structural relationships across domains. This multi-channel approach effectively handles variability in input-output domain overlaps that can be rather restricted as seen in tasks like pose-guided person image generation.

Practical Implementation:

The authors demonstrate SelectionGAN's applicability across a range of tasks characterized by semantic guidance of differing forms, such as segmentation maps in cross-view image translation or facial landmarks for expression-to-expression translation. The flexibility of the model is underscored by these examples, where it consistently outperforms a suite of baseline models, including Pix2pix, X-Fork, and state-of-the-art alternatives.

Numerical Results and Performance Evaluation

SelectionGAN's efficacy is substantiated through comprehensive experimentation on four significant tasks: cross-view image translation, facial expression generation, hand gesture translation, and person image generation. Rigorous comparison with legacy models illustrate the superiority of SelectionGAN, particularly in metrics such as SSIM, PSNR, and FID, across multiple datasets including CVUSA, Dayton, and ADE20K. Noteworthy improvements in attention-guided image refinement and semantic-cycle strategy significantly enhance the model's generative capabilities.

The paper elucidates conspicuous improvements in finer image quality and semantic alignment, as shown through detailed user studies and perceptual evaluations conducted via AMT. This sustains the claim of achieving superior detail retention and coherence in the synthesized outputs. Furthermore, by addressing semantic guidance inaccuracies through learned uncertainty maps, the authors provide a robust solution to commonly encountered issues in the translation pipeline.

Academic and Practical Implications

The research holds substantial value in theoretical and application aspects within AI for multimedia content generation. The formal introduction of multi-channel attention mechanisms and attention-based uncertainty learning propels avenues for future exploratory work in GAN architectures. Practically, with a design accommodating various semantic guidance formats, the model possesses flexibility in real-world applications from landscape change depiction to augmented reality scenarios and beyond.

SelectionGAN's advancements present potential for further exploration into larger generation spaces, accommodating increased complexity and variance in digital media translation tasks. Future enhancements may explore adaptive attention mechanisms, advanced handling of input modalities, and further optimization of translation consistency.

This paper not only contributes significantly to the domain of guided image-to-image translation but also sets a foundational precedent for developing subsequent models that integrate nuanced attention mechanisms and semantic guidance.

PDF Markdown

Related Papers

GitHub

GitHub - Ha0Tang/SelectionGAN: [CVPR 2019 Oral] Multi-Channel Attention Selection GAN with Cascaded Semantic Guidance for Cross-View Image Translation (460 stars)

Tweets

https://twitter.com/_akhaliq/status/1224869945038774273

https://twitter.com/PapersTrending/status/1226099155749285888