Motion Guided 3D Pose Estimation from Videos (2004.13985v1)

Published 29 Apr 2020 in cs.CV

Abstract: We propose a new loss function, called motion loss, for the problem of monocular 3D Human pose estimation from 2D pose. In computing motion loss, a simple yet effective representation for keypoint motion, called pairwise motion encoding, is introduced. We design a new graph convolutional network architecture, U-shaped GCN (UGCN). It captures both short-term and long-term motion information to fully leverage the additional supervision from the motion loss. We experiment training UGCN with the motion loss on two large scale benchmarks: Human3.6M and MPI-INF-3DHP. Our model surpasses other state-of-the-art models by a large margin. It also demonstrates strong capacity in producing smooth 3D sequences and recovering keypoint motion.

Citations (160)

View on Semantic Scholar

Summary

Motion Guided 3D Pose Estimation from Videos

The paper "Motion Guided 3D Pose Estimation from Videos" by Jingbo Wang et al. introduces a novel approach for monocular 3D human pose estimation by exploiting both short-term and long-term motion information from videos. The lack of depth information poses significant challenges in estimating 3D pose from 2D projections in videos, a problem the authors address by leveraging motion-based supervision.

A cornerstone of this work is the introduction of a new loss function, termed "motion loss," which stands alongside the more traditional Minkowski Distance loss functions like $\ell_1$ -loss and $\ell_2$ -loss typically used in such estimations. The motion loss is designed to capture the temporal dependencies within 3D pose sequences, overcoming some of the limitations inherent in the standard pointwise error computations, which tend to ignore the temporal structure and dependencies across frames.

To facilitate the calculation of motion loss, the authors propose an encoding termed "pairwise motion encoding," which reflects both short-term and long-term dynamics. In this method, joint trajectories are translated into coordinate pairs handled by a differentiable operator—examples include subtraction, inner, and cross-products—at different time intervals. The objective is to uphold not only spatial but also temporal consistency in motion, enhancing the smoothness and realism of the predicted 3D motion sequences.

For modeling, a novel architecture named U-shaped Graph Convolutional Network (UGCN) is introduced. The model builds upon ST-GCN, initially designed for skeleton-based action recognition, and implements a U-shape structure akin to those found in semantic segmentation and object detection networks. This architecture is crucial for capturing information both at local and global temporal scales, which aligns well with the proposed motion-based supervision. The UGCN design is characterized by its downsampling and upsampling processes within its architecture, enhancing its capability to incorporate and retain context across various scales.

Experimentally, the proposed model demonstrates state-of-the-art performance on two large-scale benchmarks: Human3.6M and MPI-INF-3DHP. Notably, the paper reports substantial improvements in mean per joint position error (MPJPE) over baseline methods as well as significant enhancements in the smoothness of the resulting 3D pose sequences. Specific contributions from this research also include halving the velocity error compared to baseline models, underscoring the utility of motion-based supervision in improving dynamic quality.

In conclusion, this paper introduces motion loss as an effective supervision method for 3D pose estimation. By integrating temporal consistency within the loss function and using the UGCN architecture to accommodate extensive temporal information, this approach provides compelling numerical results and sets a direction for future research in 3D human pose estimation. The implications suggest broader applications in human motion analysis and potentially inspire advancements in related fields, such as motion synthesis and action recognition. Future investigations may explore the integration of motion loss in alternative architectures or further refine the encoding strategies employed therein.