ARCH: Animatable Reconstruction of Clothed Humans (2004.04572v2)

Published 8 Apr 2020 in cs.GR, cs.CV, cs.LG, and eess.IV

Abstract: In this paper, we propose ARCH (Animatable Reconstruction of Clothed Humans), a novel end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. Existing approaches to digitize 3D humans struggle to handle pose variations and recover details. Also, they do not produce models that are animation ready. In contrast, ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features. Furthermore, we propose additional per-pixel supervision on the 3D reconstruction using opacity-aware differentiable rendering. Our experiments indicate that ARCH increases the fidelity of the reconstructed humans. We obtain more than 50% lower reconstruction errors for standard metrics compared to state-of-the-art methods on public datasets. We also show numerous qualitative examples of animated, high-quality reconstructed avatars unseen in the literature so far.

Authors (5)

Zeng Huang (13 papers)
Yuanlu Xu (19 papers)
Christoph Lassner (28 papers)
Hao Li (803 papers)
Tony Tung (21 papers)

Citations (305)

View on Semantic Scholar

Summary

ARCH: Animatable Reconstruction of Clothed Humans

The paper "ARCH: Animatable Reconstruction of Clothed Humans" presents an advanced framework designed to create high-fidelity 3D human avatars from a single monocular image. The primary aim of ARCH is to tackle limitations of existing 3D human digitization techniques, especially in capturing detailed geometry and animation-ready models from images depicting various poses and natural conditions.

The authors introduce the ARCH framework, which leverages a pose-aware learned model to produce rigged, full-body human avatars. The cornerstone of this approach lies in the Semantic Space (SemS) and Semantic Deformation Field (SemDF). Using these elements, ARCH can transform both 2D and 3D representations of clothed humans into a canonical space. This address ambiguities and inaccuracies in geometry that arise from pose variations and occlusions in typical datasets.

The model adopts an implicit function representation to capture fine surface geometry and textures. It is enhanced through per-pixel supervision enabled by opacity-aware differentiable rendering techniques—an approach that allows ARCH to surpass traditional methods in capturing intricate details like clothing wrinkles and hair. Our findings indicate a remarkable improvement in reconstruction fidelity, substantiated by a reduction exceeding 50% in reconstruction errors when compared to leading methods on benchmark datasets.

Strong Numerical Results

The proposed method demonstrates impressive quantitative outcomes. The normal, P2S, and Chamfer errors are significantly lower compared to existing methodologies, such as BodyNet and PIFu. These metrics underscore ARCH's superior performance in reconstructing and animating full-body avatars with high realism, especially when benchmarked on datasets like RenderPeople and BUFF.

Contributions and Implications

The paper delineates three primary contributions:

The introduction and integration of SemS and SemDF for enhanced implicit function representation of 3D human avatars.
The opacity-aware differentiable rendering which refines the representation further by performing Granular Render-and-Compare.
Demonstrating the direct application of ARCH's output in animatable avatars, leveraging motion capture data for compelling animated results.

The theoretical contribution lies in the notion of aligning the reconstruction process within a canonical space, effectively decoupling pose from shape. Practically, this innovation facilitates the automatic rigging process, making it substantially easier to animate reconstructed models. The implications of this work are particularly notable for applications in digital content creation sectors, AR/VR experiences, and potentially in telepresence technologies.

Speculations on Future Developments

Given the burgeoning interest in AI-driven 3D reconstruction, ARCH sets a new precedent by bridging the gap between static image input and dynamic, animatable 3D models. Future innovations could explore extension into video inputs, further enhancing temporal coherence and dynamic modeling. Moreover, integrations with generative models could open pathways to more diversified and customizable avatar recreations, expanding the utility in content personalization and the entertainment industry.

In conclusion, the ARCH framework embodies a nuanced advancement in the intersection of computer vision and graphics, promising both theoretical enhancements and practical utility in animating digitized human forms directly from basic image inputs. This paper presents significant progress and sets the stage for future developments in the automatic and efficient generation of detailed, animatable human models.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos