Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Motion Priors for 4D Human Body Capture in 3D Scenes (2108.10399v1)

Published 23 Aug 2021 in cs.CV and cs.AI

Abstract: Recovering high-quality 3D human motion in complex scenes from monocular videos is important for many applications, ranging from AR/VR to robotics. However, capturing realistic human-scene interactions, while dealing with occlusions and partial views, is challenging; current approaches are still far from achieving compelling results. We address this problem by proposing LEMO: LEarning human MOtion priors for 4D human body capture. By leveraging the large-scale motion capture dataset AMASS, we introduce a novel motion smoothness prior, which strongly reduces the jitters exhibited by poses recovered over a sequence. Furthermore, to handle contacts and occlusions occurring frequently in body-scene interactions, we design a contact friction term and a contact-aware motion infiller obtained via per-instance self-supervised training. To prove the effectiveness of the proposed motion priors, we combine them into a novel pipeline for 4D human body capture in 3D scenes. With our pipeline, we demonstrate high-quality 4D human body capture, reconstructing smooth motions and physically plausible body-scene interactions. The code and data are available at https://sanweiliti.github.io/LEMO/LEMO.html.

Citations (96)

Summary

  • The paper presents LEMO, a framework that leverages learned motion priors to enhance 4D human body capture from monocular videos.
  • It employs a multi-stage optimization pipeline and a novel contact-aware motion infiller to address occlusions and ensure smooth, physically plausible motions.
  • Empirical results using metrics like PSKL, 2DJE, and non-collision scores show significant performance improvements over existing methods such as PROX.

Learning Motion Priors for 4D Human Body Capture in 3D Scenes

The paper "Learning Motion Priors for 4D Human Body Capture in 3D Scenes" presents significant advancements in the domain of human motion capture within complex 3D environments. The research emphasizes recovering high-quality 3D human motion from monocular videos, which is notoriously challenging due to issues such as occlusions and partial views that interfere with human-scene interaction capture.

Central to this paper is the introduction of LEMO, a framework grounded in learning human motion priors to enhance 4D body capture. The major innovation lies in leveraging a substantial motion capture dataset, AMASS, to derive a motion smoothness prior, effectively reducing motion jitter in sequential pose recovery. Moreover, the paper demonstrates the development of a contact-aware motion infiller that utilizes per-instance self-supervised training to address frequent occlusions and interactions with surrounding scenes.

The proposed framework capitalizes on a novel multi-stage optimization pipeline that incorporates the learned motion priors with a physics-inspired contact friction term. This integration facilitates the reconstruction of smooth and physically plausible human motions that are essential for a variety of applications, including AR/VR technologies and robotics. The improvement in motion reconstruction quality is notably evident when contrasted with existing methods such as PROX, with the latter showing significant deficiencies in producing natural motion dynamics characterized by skating and jitter.

In terms of empirical validation, the LEMO framework exhibits superior performance metrics compared to baselines, as illustrated by the PSKL (Power Spectrum KL divergence) measures, 2D Joint Error (2DJE), and non-collision scores. These metrics collectively highlight the effectiveness of the smoothness and motion infilling priors in not only enhancing the temporal smoothness but also preserving the naturalness and physical plausibility of reconstructed human motions, even in the presence of occlusions.

The theoretical implications of this work suggest that leveraging large-scale motion data to train latent motion models provides robust priors that improve upon existing heuristics or physics-based formulations. Practically, the integration of such priors promises more resilient and accurate motion capture systems applicable in unstructured environments, which many traditional systems fail to robustly handle due to their reliance on precise multi-camera or sensor configurations.

Looking forward, the paper opens several avenues for future research, such as extending the framework to encompass more comprehensive physics-based motion models, thereby narrowing the performance gap between markerless systems and commercial motion capture setups even further. Additionally, exploring self-supervised learning paradigms more deeply could refine instance-specific adaptation techniques, potentially expanding the flexibility and resilience of human motion capture technologies across diverse scenarios.

In summary, this paper provides compelling evidence of the benefits gained through intelligent integration of data-driven motion priors in the field of human motion capture, setting a high benchmark for future developments in this vital area of computer vision research.