CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the Wild (2011.14679v1)

Published 30 Nov 2020 in cs.CV

Abstract: Human pose estimation from single images is a challenging problem in computer vision that requires large amounts of labeled training data to be solved accurately. Unfortunately, for many human activities (\eg outdoor sports) such training data does not exist and is hard or even impossible to acquire with traditional motion capture systems. We propose a self-supervised approach that learns a single image 3D pose estimator from unlabeled multi-view data. To this end, we exploit multi-view consistency constraints to disentangle the observed 2D pose into the underlying 3D pose and camera rotation. In contrast to most existing methods, we do not require calibrated cameras and can therefore learn from moving cameras. Nevertheless, in the case of a static camera setup, we present an optional extension to include constant relative camera rotations over multiple views into our framework. Key to the success are new, unbiased reconstruction objectives that mix information across views and training samples. The proposed approach is evaluated on two benchmark datasets (Human3.6M and MPII-INF-3DHP) and on the in-the-wild SkiPose dataset.

Authors (5)

Bastian Wandt (30 papers)
Marco Rudolph (12 papers)
Petrissa Zell (3 papers)
Helge Rhodin (54 papers)
Bodo Rosenhahn (95 papers)

Citations (94)

View on Semantic Scholar

Summary

CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the Wild

In the field of computer vision, human pose estimation from single images remains a complex challenge due to the need for extensive labeled datasets. This paper introduces "CanonPose," a novel approach that circumvents such dependency by employing a self-supervised methodology that requires only unlabeled multi-view data. This method solves the inherent difficulties of acquiring labeled data for diverse human activities, especially those occurring in outdoor environments where traditional motion capture systems fail.

CanonPose harnesses the consistency constraints of multi-view observations to effectively decompose the observed 2D poses into underlying 3D structures and corresponding camera rotations. This is achieved without the need for camera calibration, enabling training from data captured with mobile and dynamic camera systems. Additionally, when dealing with static camera setups, CanonPose can incorporate relative camera rotation constraints to enhance the learning process further.

A cornerstone of CanonPose's success lies in its innovative reconstruction objectives, which mix information across different views and samples to avoid biases that other methods might encounter due to simplifications or approximations of multi-view consistency. By leveraging unbiased objectives, CanonPose improves the accuracy of 3D pose estimation from monocular inputs, even in challenging "in-the-wild" scenarios.

The effectiveness of this approach is affirmed by its evaluation on benchmark datasets, Human3.6M and MPII-INF-3DHP, as well as on the SkiPose dataset. CanonPose achieves significant improvements in performance compared to existing self-supervised methods, and it attains competitive results when contrasted with fully supervised approaches.

CanonPose presents several pivotal contributions:

It provides a self-supervised framework for training 3D pose estimators using unlabeled images, removing the necessity for dataset-specific 2D or 3D annotations.
There is no need for pre-existing knowledge regarding the scene or camera calibration, thereby simplifying the workflow for users in practical scenarios.
The proposed method integrates multi-view data directly, obviating laborious preprocessing steps like scene geometry estimation or camera calibration.
Confidence scores from 2D joint detection algorithms are incorporated within the training pipeline, enhancing the method's robustness.

From a theoretical standpoint, CanonPose demonstrates the feasibility of disentangling 3D human poses and camera geometry from 2D observations using multi-view consistency alone. This paradigm shifts the focus away from data-heavy supervised methods, toward self-supervised alternatives that exploit the latent information within multi-view settings.

Looking forward, CanonPose's framework could inspire further innovations in unsupervised learning domains, enhancing model adaptability to rapidly changing environments such as sports and other dynamic applications. Additionally, by reducing reliance on labeled datasets, CanonPose advances the applicability of 3D pose estimation across diverse settings, offering valuable insights for future research in broader AI applications.

The success of CanonPose thus has both practical and theoretical implications, highlighting the potential of self-supervised learning strategies to address long-standing challenges in 3D human pose estimation. As AI continues to develop, methods like CanonPose will likely play a crucial role in enabling more accessible and accurate pose estimation models adaptable to various real-world scenarios.

PDF Markdown

Related Papers

YouTube

Show All Videos