Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Image Spatial Transformation for Person Image Generation (2003.00696v2)

Published 2 Mar 2020 in cs.CV and cs.AI

Abstract: Pose-guided person image generation is to transform a source person image to a target pose. This task requires spatial manipulations of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. Specifically, our model first calculates the global correlations between sources and targets to predict flow fields. Then, the flowed local patch pairs are extracted from the feature maps to calculate the local attention coefficients. Finally, we warp the source features using a content-aware sampling method with the obtained local attention coefficients. The results of both subjective and objective experiments demonstrate the superiority of our model. Besides, additional results in video animation and view synthesis show that our model is applicable to other tasks requiring spatial transformation. Our source code is available at https://github.com/RenYurui/Global-Flow-Local-Attention.

Citations (168)

Summary

  • The paper proposes a novel differentiable global-flow and local-attention framework that effectively guides pose-driven person image generation.
  • It integrates global correlation computation with local attention mechanisms to optimize spatial feature warping and preserve visual fidelity.
  • Experimental results demonstrate significantly improved FID and LPIPS scores on datasets like DeepFashion, highlighting its practical benefits for image synthesis.

Deep Image Spatial Transformation for Person Image Generation

This paper presents a novel framework for pose-guided person image generation, which involves transforming a source image of a person into a new pose while maintaining visual fidelity. Traditional CNNs are constrained by their inherent inability to spatially transform inputs. To address this, the authors propose a differentiable global-flow local-attention framework.

Methodology

The proposed model, which integrates flow-based operations with attention mechanisms, reassembles input features effectively:

  1. Global Correlation Computation: The model calculates global correlations between source and target images to predict flow fields. These fields guide the spatial transformation process by determining how source features should be restructured.
  2. Local Attention Mechanisms: Flowed local patch pairs are used to compute local attention coefficients. This focus on local regions ensures that each part of the target image is correlated to specific source patches, optimizing the transformation accuracy.
  3. Content-Aware Sampling: A novel sampling method considers local attention coefficients for warping source features to the target pose. This step ensures that manipulated features retain contextually relevant details.

Experimental Results

The model's effectiveness is evaluated both qualitatively and quantitatively across datasets like DeepFashion and Market-1501:

  • Quantitative Metrics: A notable improvement in FID and LPIPS scores compared to state-of-the-art methods demonstrates improved perceptual realism and detail retention. For instance, the paper reports an FID of 10.573 on the DeepFashion dataset.
  • Subjective Evaluation: Human evaluation through Just Noticeable Difference (JND) shows a significant increase in realism perception versus traditional models.

Implications and Future Directions

The proposed framework not only excels in pose-guided image generation but also shows promise for broader applications needing spatial transformation, such as:

  • Video Animation: By preserving consistency and detail across frames, the model could enhance video animation processes.
  • View Synthesis: Its utility in generating novel object views from single input viewpoints holds potential for applications in mixed reality and computer vision.

The integration of global and local spatial transformation methodologies could significantly influence future research in image generation and manipulation tasks. Further investigation may explore optimizing training stability and extending the model to handle more complex spatial transformations without reliance on supplementary information.

Overall, this work exemplifies a sophisticated approach to enhancing image generation technologies by addressing fundamental challenges in spatial manipulation with innovative neural network architectures.