Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Monocular Total Capture: Posing Face, Body, and Hands in the Wild (1812.01598v1)

Published 4 Dec 2018 in cs.CV and cs.GR

Abstract: We present the first method to capture the 3D total motion of a target person from a monocular view input. Given an image or a monocular video, our method reconstructs the motion from body, face, and fingers represented by a 3D deformable mesh model. We use an efficient representation called 3D Part Orientation Fields (POFs), to encode the 3D orientations of all body parts in the common 2D image space. POFs are predicted by a Fully Convolutional Network (FCN), along with the joint confidence maps. To train our network, we collect a new 3D human motion dataset capturing diverse total body motion of 40 subjects in a multiview system. We leverage a 3D deformable human model to reconstruct total body pose from the CNN outputs by exploiting the pose and shape prior in the model. We also present a texture-based tracking method to obtain temporally coherent motion capture output. We perform thorough quantitative evaluations including comparison with the existing body-specific and hand-specific methods, and performance analysis on camera viewpoint and human pose changes. Finally, we demonstrate the results of our total body motion capture on various challenging in-the-wild videos. Our code and newly collected human motion dataset will be publicly shared.

Citations (317)

Summary

  • The paper introduces a novel approach capturing full 3D human pose from monocular video using 3D Part Orientation Fields.
  • It develops an extensive dataset with diverse poses to train a Fully Convolutional Network for reconstructing face, body, and hands.
  • The method demonstrates robust performance in challenging in-the-wild scenarios, reducing complexity in 3D motion capture.

Monocular Total Capture: Posing Face, Body, and Hands in the Wild

This paper presents a novel approach aimed at capturing the comprehensive 3D motion of human body, face, and hands from a single camera perspective, termed as "monocular total capture". This technique moves beyond traditional methods requiring elaborate multi-camera setups and instead, relies on monocular video input to reconstruct motion using a 3D deformable mesh model. The proposed method integrates a robust representation called 3D Part Orientation Fields (POFs) that effectively encodes the 3D orientation of various body segments within the typical 2D image space, which is predicted by an appropriately trained Fully Convolutional Network (FCN).

Key to the approach is the development of an extensive dataset featuring 3D human motion captured from a variety of subjects in diverse poses and movements within a multiview environment. This dataset facilitates the training of the model which in turn leverages prior knowledge inherent in a 3D deformable human model to deduce the total posture dynamics from the FCN attributes. Additionally, a texture-based tracking mechanism is introduced to ensure temporally coherent motion outputs, mitigating issues such as jitter and unwanted artifacts.

The method undergoes rigorous quantitative evaluations where it is juxtaposed with existing body-specific and hand-specific methods, exhibiting comparable accuracy. Notably, the performance is analyzed across different camera viewpoints and human pose variations, which underscores its robustness across challenging real-world scenarios. Demonstrations of the technique applied to various challenging "in-the-wild" videos are also provided which underscores its applicability into practical scenarios.

The implications are significant for domains such as entertainment, sports analysis, and even sociological research where understanding human dynamics without extensive setup can open new avenues. By publicly releasing both the code and the newly collected dataset, the research provides the foundation for further advancements and is primed for further development in fields interfacing with computer vision, machine learning, and AI-driven interaction models. Future exploration could refine its real-time application and scalability to multi-individual settings, expanding its utility horizon considerably.

This work marks an important direction towards reducing the complexity and cost associated with 3D motion capture, presenting a versatile tool that stands to benefit myriad applications reliant on accurate human motion reconstructions.