Direct Prediction of 3D Body Poses from Motion Compensated Sequences (1511.06692v4)

Published 20 Nov 2015 in cs.CV

Abstract: We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence to recover the 3D pose of people. Previous approaches typically compute candidate poses in individual frames and then link them in a post-processing step to resolve ambiguities. By contrast, we directly regress from a spatio-temporal volume of bounding boxes to a 3D pose in the central frame. We further show that, for this approach to achieve its full potential, it is essential to compensate for the motion in consecutive frames so that the subject remains centered. This then allows us to effectively overcome ambiguities and improve upon the state-of-the-art by a large margin on the Human3.6m, HumanEva, and KTH Multiview Football 3D human pose estimation benchmarks.

Citations (201)

View on Semantic Scholar

Summary

The paper introduces a novel method for direct 3D body pose prediction from motion-compensated monocular video sequences, overcoming limitations of frame-by-frame analysis.
The methodology involves motion compensation using CNNs, extracting spatio-temporal features like 3D HOG, and using deep networks for regression to the 3D pose.
Empirical evaluations on datasets like Human3.6m demonstrate significant performance improvements, achieving up to a 23% reduction in mean error for certain actions compared to baseline methods.

An Evaluation of Direct Prediction of 3D Body Poses from Motion Compensated Sequences

The paper introduces a novel methodology to improve the prediction of 3D human poses using monocular video sequences. The authors propose a system that leverages motion information from multiple video frames, aiming to overcome the challenges inherent in 3D pose recovery from 2D video due to occlusions and depth ambiguities.

Methodology

The proposed method differs fundamentally from previous approaches that either predicted 3D poses for individual frames independently or relied on post-processing to establish temporal consistency across frames. Instead, this method extracts spatio-temporal information from sequences and directly computes the 3D pose for the central frame of the sequence. The steps in this process involve:

Motion Compensation: The authors utilize Convolutional Neural Networks (CNNs) to center the person in bounding boxes across frames, forming a "rectified spatio-temporal volume" (RSTV). This ensures that the spatio-temporal features are stable across the frames.
Feature Extraction: Spatio-temporal features are derived using the 3D Histogram of Oriented Gradients (HOG) rather than purely spatial or temporal features. This method captures the combined appearance and motion data, crucial for improving prediction robustness.
Regression to 3D Pose: Various regression techniques were considered, including Kernel Ridge Regression (KRR), Kernel Dependency Estimation (KDE), and Deep Networks (DN). Among these, the use of DNs provided the most promising results in mapping spatio-temporal features to 3D poses.

Empirical Evaluation

The method is evaluated on several standard datasets, including Human3.6m, HumanEva-I/II, and KTH Multiview Football II, with significant performance improvements noted in the results. Notably, on the Human3.6m dataset, the proposed approach achieved a marked reduction in mean error compared to baseline methods, with improvements reaching as high as 23% in precision for certain action types. Similarly, significant improvements in 3D joint position estimation were recorded for HumanEva datasets and the KTH Multiview Football II dataset, demonstrating the versatility and high performance of the approach across diverse environments.

Implications and Future Directions

This work provides a compelling case for integrating temporal sequence information earlier in the 3D pose estimation process. By treating the task as a spatio-temporal feature extraction and regression problem, the authors can better handle inherent challenges such as self-occlusion and mirroring ambiguities.

From a practical standpoint, this approach could lead to more robust applications in fields such as surveillance, sports analytics, and human-computer interaction, where accurate body pose estimation is imperative. Theoretically, this work may stimulate further research into leveraging deep learning and spatio-temporal processing for other applications involving articulated motion capture.

Future research could focus on extending this approach to manage more complex, real-world scenarios with variable lighting and environmental conditions, and incorporating additional modalities such as multi-view setups or depth cameras for cases where they are available. This would potentially enhance the robustness and accuracy of 3D pose estimation systems further, thereby broadening their applicability in various domains.

PDF Markdown