ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild (2409.15269v2)

Published 23 Sep 2024 in cs.CV

Abstract: While previous years have seen great progress in the 3D reconstruction of humans from monocular videos, few of the state-of-the-art methods are able to handle loose garments that exhibit large non-rigid surface deformations during articulation. This limits the application of such methods to humans that are dressed in standard pants or T-shirts. Our method, ReLoo, overcomes this limitation and reconstructs high-quality 3D models of humans dressed in loose garments from monocular in-the-wild videos. To tackle this problem, we first establish a layered neural human representation that decomposes clothed humans into a neural inner body and outer clothing. On top of the layered neural representation, we further introduce a non-hierarchical virtual bone deformation module for the clothing layer that can freely move, which allows the accurate recovery of non-rigidly deforming loose clothing. A global optimization jointly optimizes the shape, appearance, and deformations of the human body and clothing via multi-layer differentiable volume rendering. To evaluate ReLoo, we record subjects with dynamically deforming garments in a multi-view capture studio. This evaluation, both on existing and our novel dataset, demonstrates ReLoo's clear superiority over prior art on both indoor datasets and in-the-wild videos.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a layered neural representation that separately models body and garment details for improved 3D reconstruction.
The method leverages a virtual bone deformation module that enables accurate tracking of dynamic, free-form garment movements.
Multi-layer differentiable volume rendering yields high-fidelity reconstructions, outperforming state-of-the-art baselines on challenging datasets.

ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild

The paper "ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild" presents a novel methodology aimed at solving the challenge of 3D human reconstruction from monocular video inputs, specifically focusing on subjects wearing loose garments. This focus distinguishes it from prior methodologies that predominantly target tight-fitting clothes which are easier to model due to their closer adherence to body contours.

Core Contributions

The key contributions of this work are three-fold:

Layered Neural Human Representation: The authors introduce a multi-layer representation that separately models the inner body and outer clothing layers using neural implicit functions. This decomposition improves model expressiveness and allows capturing intricate details of the garments, which are often lost in single-layer models.
Virtual Bone Deformation Module: Unlike conventional methods that rely on skeletal deformations derived from body poses, the proposed virtual bone deformation module permits free-form movement, thereby accurately tracking dynamically deforming garments. This module applies non-hierarchical deformations that are not limited by the anatomical constraints of human skeletons, enabling the recovery of complex and dynamic garment motions.
Multi-Layer Differentiable Volume Rendering: By extending standard volume rendering techniques to handle multiple neural layers, the authors achieve high-fidelity reconstruction of both the human body and outer clothing. This rendering approach ensures temporally consistent and detailed visual outputs.

Methodological Insights

Layered Representation

The inner body and outer garments are modeled through separate networks that predict Signed Distance Fields (SDFs) and radiance values. By decomposing the clothed human figure into these two layers, the model can handle the intricate details and larger deformations of loose garments which single-layer models struggle with.

Hybrid Deformation Modeling

The hybrid deformation strategy comprises:

Skeletal Deformation: Uses Linear Blend Skinning (LBS) for inner body deformations, driven by SMPL model skeletal poses.
Virtual Bone Deformation: Introduces a set of virtual bones for the garment layer, which follow non-hierarchical, free-form motions to accurately capture the movement of loose garments.

The virtual bones' positions are refined through a learning process that ensures their transformations accurately reflect the dynamics of the garment fabric under various motions.

Differentiable Volume Rendering

The authors apply a multi-round sampling process where points in the body and garment layers are evaluated and combined through a sorting and weighting mechanism, allowing the rendering of complex occlusion scenarios and ensuring coherence between layers. This facilitates a realistic reconstruction from monocular images.

Experimental Validation

The efficacy of ReLoo was demonstrated on the newly introduced MonoLoose dataset as well as the existing DynaCap dataset. By capturing humans in highly dynamic scenarios wearing loose garments and generating robust 3D reconstructions and novel view synthesis tasks, the method was shown to surpass the performance of existing baseline methods (e.g., SelfRecon, Vid2Avatar, SCARF) across various metrics.

Quantitative Results

In 3D surface reconstruction on the MonoLoose dataset, ReLoo achieved a Chamfer distance of 1.93 cm, outperforming baselines which recorded up to 3.13 cm.
For the novel view synthesis task, the method attained a PSNR of 29.2 and an SSIM of 0.970 on the MonoLoose dataset, distinctly higher than competing approaches.

Theoretical Implications and Future Work

Theoretically, ReLoo paves the way for more sophisticated and nuanced human modeling techniques by focusing on the independent dynamics of outer garments. This challenges the traditional reliance on skeletal deformation exclusively and opens up new avenues to model other complex deformations and interactions between the body and outer layers.

Future work could build on this foundation by exploring broader applications such as virtual try-ons in e-commerce, more complex multi-layer garment reconstruction, and integrating additional sensory inputs (e.g., depth sensors) to further enhance reconstruction fidelity. Integrating unsupervised or semi-supervised learning techniques might further reduce the reliance on annotated data, extending the model's generalizability.

Practical Implications

From a practical standpoint, this method can significantly enhance applications requiring realistic human avatars, such as in virtual reality, gaming, film production, and telepresence systems. By accommodating a wider variety of clothing and motions, ReLoo addresses a critical need in the democratization of high-quality virtual human representations from easily obtainable video data.

Conclusion

In conclusion, the ReLoo methodology offers substantial improvements in handling the unique challenges posed by loose garments in 3D human reconstruction. By employing a layered neural representation, non-hierarchical virtual bone deformations, and a sophisticated volume rendering approach, it sets a new benchmark in achieving realistic, high-fidelity reconstructions from monocular video in the wild.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ChenGuo96/status/1840413558116626613