Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling (2407.11962v2)

Published 16 Jul 2024 in cs.CV, cs.AI, and cs.LG

Abstract: This paper introduces Motion-oriented Compositional Neural Radiance Fields (MoCo-NeRF), a framework designed to perform free-viewpoint rendering of monocular human videos via novel non-rigid motion modeling approach. In the context of dynamic clothed humans, complex cloth dynamics generate non-rigid motions that are intrinsically distinct from skeletal articulations and critically important for the rendering quality. The conventional approach models non-rigid motions as spatial (3D) deviations in addition to skeletal transformations. However, it is either time-consuming or challenging to achieve optimal quality due to its high learning complexity without a direct supervision. To target this problem, we propose a novel approach of modeling non-rigid motions as radiance residual fields to benefit from more direct color supervision in the rendering and utilize the rigid radiance fields as a prior to reduce the complexity of the learning process. Our approach utilizes a single multiresolution hash encoding (MHE) to concurrently learn the canonical T-pose representation from rigid skeletal motions and the radiance residual field for non-rigid motions. Additionally, to further improve both training efficiency and usability, we extend MoCo-NeRF to support simultaneous training of multiple subjects within a single framework, thanks to our effective design for modeling non-rigid motions. This scalability is achieved through the integration of a global MHE and learnable identity codes in addition to multiple local MHEs. We present extensive results on ZJU-MoCap and MonoCap, clearly demonstrating state-of-the-art performance in both single- and multi-subject settings. The code and model will be made publicly available at the project page: https://stevejaehyeok.github.io/publications/moco-nerf.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces MoCo-NeRF, which leverages radiance residual fields to model both rigid and non-rigid human motions effectively.
The paper employs a single multiresolution hash encoder to jointly learn canonical T-pose and dynamic pose-induced radiance residuals, reducing training complexity.
The paper demonstrates superior rendering quality and efficiency on ZJU-MoCap and MonoCap datasets, achieving high PSNR, SSIM, and significantly faster training times.

Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling

The paper entitled "Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling" introduces MoCo-NeRF, a sophisticated framework engineered for free-viewpoint rendering of human videos captured via a monocular camera. The paper targets the complexities associated with dynamic clothed human models, which encompass both skeletal articulations and non-rigid motions such as cloth wrinkles. Traditional approaches struggle with these non-rigid motions due to the high learning complexity and absence of direct supervision.

Key Contributions

Radiance Residual Field Modeling: MoCo-NeRF advances dynamic human modeling by introducing a method to learn radiance residual fields. This method contrasts with conventional spatial deviation models that predict 3D offsets, by forecasting radiance discrepancies (residuals) directly at the pixel color level. By doing this, it alleviates the burden of unbounded spatial offset learning and aligns with traditional RGB pixel-based supervision, enhancing the model's learning efficiency.
Single Multiresolution Hash Encoding (MHE): A novel approach is employed wherein a single MHE concurrently learns the canonical T-pose radiance field (rigid transformations) and the radiance residual field (non-rigid transformations). The canonical T-pose provides a static reference frame, while the radiance residual field handles pose-induced variations, fostering efficient and coherent non-rigid motion modeling.
Pose-Embedded Implicit Feature: The paper introduces cross-attention mechanisms to augment the learning process. A learnable base code is modulated by the body pose information through cross-attention, which injects high-frequency pose-adaptive implicit features into the radiance residual learning. This innovation allows the system to generate more discriminative features tailored to each pose, leading to enhanced rendering quality.
Scalable Multi-Subject Learning: A significant advancement presented is the extension of MoCo-NeRF to support the concurrent training of multiple subjects. This is achieved through a combination of global and local MHEs, along with learnable identity codes which individualize the shared MLP decoders' outputs. The architecture thus maintains efficient memory usage and training time, while rendering multiple subjects with distinct characteristics.

Results and Comparisons

The performance of MoCo-NeRF has been evaluated extensively on the ZJU-MoCap and MonoCap datasets. These datasets are benchmarks for dynamic human modeling tasks, allowing for robust comparisons:

ZJU-MoCap: The framework exhibited a PSNR of 31.06, SSIM of 0.9734, and LPIPS* of 28.83, surpassing the performances of HumanNeRF, Instant-NVR, and GauHuman in most metrics. This indicates superior visual quality and perceptual accuracy, especially in capturing detailed non-rigid motions.
MonoCap: The system similarly outperformed baseline methods, cementing its applicability across different datasets and contexts.

The paper further demonstrates substantial practical advantages. MoCo-NeRF achieved significant reductions in training times as compared to HumanNeRF, requiring only ~2 hours on an RTX 3090 GPU for a single subject, and maintaining efficiency across multiple subjects—a feat particularly pronounced given that multi-subject training saw only a marginal increase in training duration.

Practical and Theoretical Implications

The practical implications of MoCo-NeRF are vast. In applications such as virtual reality, gaming, and remote communication, where lifelike human rendering from limited camera inputs is crucial, MoCo-NeRF offers a robust and efficient solution. By reducing the computational complexity and training time, it paves the way for real-time applications and democratizes access to high-quality dynamic human modeling.

Theoretically, MoCo-NeRF's approach to decomposing and learning radiance fields based on motion types can be extended to other domains of computer vision and graphics. This compositional learning framework provides a blueprint for tackling complex non-rigid transformations and can be adapted to handle other dynamic objects beyond human figures.

Future Directions

The research opens new avenues for further exploration in dynamic scene rendering and AI-driven content creation. Potential future developments may include:

Enhanced Scene Complexity: Extending the compositional approach to handle more intricate scenes involving multiple interacting dynamic objects.
Real-time Adaptations: Further optimizing the model for real-time performance in edge computing environments to facilitate applications in augmented reality.
Generalization to Unseen Motions: Investigating methods to improve the model's generalization capabilities for unseen poses and actions, potentially leveraging transfer learning techniques.

In conclusion, this paper presents MoCo-NeRF as a comprehensive solution for the complex task of rendering dynamic humans from monocular videos. Its innovative compositional approach and efficiency in handling non-rigid motions significantly advance the field of neural human rendering, setting a new paradigm for future research and applications.

PDF Markdown

Related Papers

GitHub

MoCo-NeRF
GitHub - stevejaehyeok/MoCo-NeRF (9 stars)

Tweets

https://twitter.com/stevejaehyeok/status/1813964487428559081

YouTube

Show All Videos