Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop (1909.12828v1)

Published 27 Sep 2019 in cs.CV

Abstract: Model-based human pose estimation is currently approached through two different paradigms. Optimization-based methods fit a parametric body model to 2D observations in an iterative manner, leading to accurate image-model alignments, but are often slow and sensitive to the initialization. In contrast, regression-based methods, that use a deep network to directly estimate the model parameters from pixels, tend to provide reasonable, but not pixel accurate, results while requiring huge amounts of supervision. In this work, instead of investigating which approach is better, our key insight is that the two paradigms can form a strong collaboration. A reasonable, directly regressed estimate from the network can initialize the iterative optimization making the fitting faster and more accurate. Similarly, a pixel accurate fit from iterative optimization can act as strong supervision for the network. This is the core of our proposed approach SPIN (SMPL oPtimization IN the loop). The deep network initializes an iterative optimization routine that fits the body model to 2D joints within the training loop, and the fitted estimate is subsequently used to supervise the network. Our approach is self-improving by nature, since better network estimates can lead the optimization to better solutions, while more accurate optimization fits provide better supervision for the network. We demonstrate the effectiveness of our approach in different settings, where 3D ground truth is scarce, or not available, and we consistently outperform the state-of-the-art model-based pose estimation approaches by significant margins. The project website with videos, results, and code can be found at https://seas.upenn.edu/~nkolot/projects/spin.

Citations (935)

View on Semantic Scholar

Summary

The paper presents a hybrid SPIN framework that fuses deep regression with iterative SMPL model-fitting to improve 3D human pose and shape estimation.
It achieves state-of-the-art performance, demonstrating a mean reconstruction error of 41.1mm on Human3.6M, outperforming previous methods.
The approach effectively leverages 2D keypoint annotations, enabling robust training even when 3D ground truth data is scarce.

Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop

The paper "Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop," authored by Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis, explores an integrated approach to 3D human pose and shape estimation by combining optimization-based methods with regression-based methods. The solution presented is referred to as SPIN (SMPL oPtimization IN the loop).

Core Approach

The current landscape of model-based human pose estimation is divided into optimization-based methods and regression-based methods. Optimization-based methods iteratively fit a parametric body model to 2D keypoints, achieving high accuracy but suffering from sensitivity to initialization and inefficiency. Regression-based methods leverage deep learning to predict model parameters directly from images, offering speed and efficiency but requiring significant data and usually failing to achieve pixel-level accuracy.

SPIN leverages the strengths of both methodologies. A deep neural network initializes the iterative optimization, which fits a parametric body model to 2D joints during the training loop. Conversely, the iterative optimization provides refined model fits that serve as strong supervision signals for the neural network.

Model Architecture

The model utilizes the SMPL body model as its parametric representation. The neural network architecture initializes with the regression-based estimation of SMPL parameters ( $\Theta_{reg}$ ), including both pose and shape parameters. These estimated parameters refine through an iterative fitting routine akin to the SMPLify approach. Initially applied ground truth 2D keypoints enable the iterative fitting, while the optimized parameters ( $\Theta_{opt}$ ) provide explicit supervision to further the neural network's training.

Noteworthy Contributions and Results

The authors highlight key contributions from their approach:

Integration of Regression and Optimization: SPIN creates a self-improving training mechanism wherein initial regression estimates help the optimization routine, and the optimized fits strengthen the regression network.
Independence from 3D Ground Truth: The approach demonstrates robustness even when 3D ground truth data is scarce or unavailable, relying primarily on 2D keypoint annotations to train the model effectively.
Superior Model-based Supervision: The use of full 3D model supervision as opposed to weaker 2D reprojection errors significantly enhances regression performance.
State-of-the-Art Performance: SPIN outperforms existing model-based approaches across multiple challenging datasets, including Human3.6M, MPI-INF-3DHP, LSP, and 3DPW.

Empirical results illustrate SPIN's efficacy, with consistent improvements over other leading techniques. For example, significant performance metrics include a mean reconstruction error of 41.1mm on the Human3.6M dataset, outperforming previous state-of-the-art results of 56.8mm.

Practical and Theoretical Implications

Practically, SPIN's hybrid approach provides a robust solution for applications requiring precise human body estimation, ranging from virtual reality to human-computer interaction. Its ability to function well with limited 3D ground truth data expands its usability in various deployment scenarios where acquiring precise 3D annotations is challenging.

Theoretically, this work advocates for the synergy between classical optimization techniques and modern deep learning models, highlighting the mutual benefits of such a collaboration. This opens avenues for further research in hybrid techniques for other computer vision tasks where analogous dual paradigms exist.

Future Directions

Possible future developments include enhancing SPIN to handle scenarios involving multiple people in close proximity or incorporating more expressive human body models that account for face and hand detail. Investigations into optimizing other parts of the training loop to reduce computational overhead or further exploiting the cyclic benefits of self-improving models can also provide substantial improvements.

Conclusion

The SPIN approach presents a methodologically sound and empirically validated strategy for 3D human pose and shape estimation, establishing a precedent for future research in hybrid model architectures. The work leverages the complementary strengths of optimization and regression to provide an efficient and accurate solution, thereby pushing the field forward in both theoretical understanding and practical application.

For further information, videos, and code associated with this paper, the project website is available at SPIN Project.

PDF Markdown