DressRecon: Freeform 4D Human Reconstruction from Monocular Video (2409.20563v2)

Published 30 Sep 2024 in cs.CV

Abstract: We present a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions. Prior work in human reconstruction is either limited to tight clothing with no object interactions, or requires calibrated multi-view captures or personalized template scans which are costly to collect at scale. Our key insight for high-quality yet flexible reconstruction is the careful combination of generic human priors about articulated body shape (learned from large-scale training data) with video-specific articulated "bag-of-bones" deformation (fit to a single video via test-time optimization). We accomplish this by learning a neural implicit model that disentangles body versus clothing deformations as separate motion model layers. To capture subtle geometry of clothing, we leverage image-based priors such as human body pose, surface normals, and optical flow during optimization. The resulting neural fields can be extracted into time-consistent meshes, or further optimized as explicit 3D Gaussians for high-fidelity interactive rendering. On datasets with highly challenging clothing deformations and object interactions, DressRecon yields higher-fidelity 3D reconstructions than prior art. Project page: https://jefftan969.github.io/dressrecon/

Summary

The paper introduces a hierarchical bag-of-bones model that disentangles body and clothing deformations to accurately reconstruct dynamic human avatars.
It employs neural implicit representations combined with test-time optimization using image-based priors to capture complex non-rigid motions.
Empirical results show improved 3D reconstruction fidelity and rendering quality, highlighting advancements over traditional monocular reconstruction methods.

Freeform 4D Human Reconstruction from Monocular Video with DressRecon

The paper "DressRecon: Freeform 4D Human Reconstruction from Monocular Video" introduces a novel approach for reconstructing dynamic human avatars, particularly focusing on scenarios with loose clothing and object interactions. This task has traditionally been challenging due to the intricate deformations and the monocular nature of video inputs. The authors address these complexities by proposing a method that effectively disentangles body and clothing deformations using a hierarchical bag-of-bones deformation model.

Methodological Overview

The core of DressRecon's approach lies in a hierarchical and compositional model that captures both body and clothing movements separately. This is achieved through the introduction of two distinct layers of Gaussian-based deformations—body and clothing layers—that allow for precise motion representation. The separation is vital for handling complex scenarios where loose garments and accessory objects are involved, which is a limitation in many existing methods restricted to tight clothing or requiring complex, multi-view systems.

DressRecon starts with a canonical shape representation as a neural signed distance field and applies a time-varying deformation field, leveraging a combination of generic human priors and video-specific articulations. These articulations are fitted via test-time optimization, ensuring flexibility and high fidelity in reconstructions. An essential feature of the method is its use of neural implicit models to disentangle the complex deformations seen in loose clothing and accessories, enabling accurate portrayal of human interactions with varied objects.

Technical Contributions

Hierarchical Bag-of-Bones Deformation Model: A noteworthy contribution is the choice of a hierarchical deformation model that systematically separates body and clothing motion, which provides a clear advantage in scenarios with extreme non-rigid deformations. The model initializes Gaussian motion descriptors using pretrained body pose models, thus offering a robust starting point that facilitates convergence during optimization.
Optimization with Image-Based Priors: The approach ingeniously utilizes foundational priors from state-of-the-art image processing techniques such as surface normals, optical flow, and segmentations. This integration helps stabilize the optimization process, addressing the challenges posed by the monocular video’s inherent ambiguities in 3D shape recovery.
3D Gaussian Refinement: To enhance rendering quality post-reconstruction, DressRecon employs a refinement step that converts implicit neural fields into explicit 3D Gaussians. This step improves the fidelity and interactivity of rendering, making it suitable for high-quality applications.

Evaluation and Implications

The empirical evaluation demonstrates DressRecon's superiority over existing methods, particularly in scenarios with dynamic clothing and object interactions. It consistently outperforms in 3D reconstruction tasks and produces higher fidelity results in both shape and appearance. Numerical results showcase significant advancements in metrics such as 3D chamfer distance and rendering accuracy, highlighting the method's robustness and effectiveness.

From a theoretical perspective, DressRecon advances our understanding of disentangled motion representation in neural fields and underscores the utility of leveraging hierarchical structures in handling complex motion data. Practically, the method opens avenues for more accessible and scalable human reconstruction, potentially impacting fields such as virtual reality, gaming, and digital content creation.

Future Developments

Looking ahead, there are opportunities to further explore the interaction physics between humans, clothing, and objects in more depth. This could involve integrating physical simulation models to enhance realism, particularly for reanimation purposes. Additionally, extending applications to diverse settings, such as multi-person interactions and various environmental contexts, would broaden the utility and generalization of this work.

In conclusion, DressRecon represents a significant step forward in 4D human reconstruction from monocular video inputs, providing an effective solution to longstanding issues within the domain. The authors' focus on separation of body and clothing deformations and the incorporation of advanced image priors sets a promising direction for future research and application development in AI-driven human modeling.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/jefftan969/status/1844075801769279914

https://twitter.com/ai_bites/status/1841108645049778309

https://twitter.com/arXivGPT/status/1841955751650099495