A Generative Model of People in Clothing (1705.04098v3)

Published 11 May 2017 in cs.CV

Abstract: We present the first image-based generative model of people in clothing for the full body. We sidestep the commonly used complex graphics rendering pipeline and the need for high-quality 3D scans of dressed people. Instead, we learn generative models from a large image database. The main challenge is to cope with the high variance in human pose, shape and appearance. For this reason, pure image-based approaches have not been considered so far. We show that this challenge can be overcome by splitting the generating process in two parts. First, we learn to generate a semantic segmentation of the body and clothing. Second, we learn a conditional model on the resulting segments that creates realistic images. The full model is differentiable and can be conditioned on pose, shape or color. The result are samples of people in different clothing items and styles. The proposed model can generate entirely new people with realistic clothing. In several experiments we present encouraging results that suggest an entirely data-driven approach to people generation is possible.

Authors (3)

Christoph Lassner (28 papers)
Gerard Pons-Moll (81 papers)
Peter V. Gehler (13 papers)

Citations (228)

View on Semantic Scholar

Summary

The paper introduces ClothNet, a two-stage generative model that produces realistic 256x256 images of people in clothing using a VAE-based latent sketch and render module.
The methodology supports both conditional and unconditional generation, allowing image synthesis based on specific appearance cues.
Empirical results show synthetic images achieved 24.7% indistinguishability from real photos and trained segmentation performance at about 85% of real data.

A Generative Model of People in Clothing: A Technical Overview

The paper "A Generative Model of People in Clothing" introduces a novel approach to generating images of people wearing various types of clothing without relying on complex computer graphics pipelines or requiring high-quality 3D scans. This work tackles the significant challenges posed by human pose, shape, and appearance variance by implementing a two-stage learning process. The authors' approach consists of training a generative model directly from an extensive image database, effectively leveraging the data-driven nature of modern machine learning techniques in contrast to traditional graphics methods.

Key Contributions

The proposed method, named ClothNet, consists of two parts: the latent sketch module and the portray module. The latent sketch module employs a Variational Autoencoder (VAE) to generate a semantic segmentation of a person's body and clothing, capturing the high variance in clothing shape and allowing for efficient sampling from a latent space. The portray module is responsible for rendering realistic textures based on the generated sketches. With these modules combined, ClothNet can produce images of people in varying clothing styles, enabling a level of realism that approaches true photography.

Two notable features of ClothNet include:

Conditional and Unconditional Generation: The framework allows for both conditioned and unconditioned generation scenarios. This flexibility enables the model to produce images conditioned on specific appearance cues such as pose and colors, while also supporting random sampling for unconditioned image generation.
High-Resolution Outputs: While many generative models traditionally operate at low resolutions, ClothNet achieves image generation at a resolution of 256x256 pixels. This is accomplished through the employment of encoder-decoder architectures, such as U-Net and context encoders, which facilitate fine-grained texture generation with the help of skip connections.

Empirical Evaluation

The paper provides a comprehensive empirical evaluation of the ClothNet framework. A user paper is conducted to assess the perceived realism of generated images, revealing that up to 24.7% of participants could not distinguish the synthetic images from real ones. Furthermore, to validate the utility of the generated images as training data, a semantic segmentation model was trained on the synthetic dataset and achieved approximately 85% of the performance observed on real data.

During color conditioning experiments, the model successfully follows specified cues while maintaining intricate texture details like patterns and wrinkles. These results underline the potential effectiveness of ClothNet in producing realistic imagery that can serve various applications requiring synthetic datasets, such as training machine vision models.

Implications and Future Directions

The results of this paper suggest that image-based generative models can circumvent traditional rendering challenges and produce high-quality, realistic images of clothed individuals. This holds potential practical implications in fields such as fashion, augmented reality, and virtual reality, where clothing simulation and realistic avatar creation are vital.

Theoretically, this research demonstrates the prowess of VAEs and adversarial networks in capturing complex visual appearances, opening avenues for their application beyond architectural settings. Future research could focus on improving resolution further or integrating more dynamic environmental contexts to enhance realism. Additionally, leveraging larger and more diverse datasets for training could widen the applicability of ClothNet to even broader generalizations of human appearances.

This contribution underpins a significant step forward in providing a cost-effective and versatile solution to realistic person image generation, highlighting the transformative potential of data-driven models in computer vision. The complete set of resources for ClothNet, including data and code, is projected to be made available for academic purposes to accelerate continued exploration and development in this domain.

PDF Markdown