Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross View Fusion for 3D Human Pose Estimation (1909.01203v1)

Published 3 Sep 2019 in cs.CV

Abstract: We present an approach to recover absolute 3D human poses from multi-view images by incorporating multi-view geometric priors in our model. It consists of two separate steps: (1) estimating the 2D poses in multi-view images and (2) recovering the 3D poses from the multi-view 2D poses. First, we introduce a cross-view fusion scheme into CNN to jointly estimate 2D poses for multiple views. Consequently, the 2D pose estimation for each view already benefits from other views. Second, we present a recursive Pictorial Structure Model to recover the 3D pose from the multi-view 2D poses. It gradually improves the accuracy of 3D pose with affordable computational cost. We test our method on two public datasets H36M and Total Capture. The Mean Per Joint Position Errors on the two datasets are 26mm and 29mm, which outperforms the state-of-the-arts remarkably (26mm vs 52mm, 29mm vs 35mm). Our code is released at \url{https://github.com/microsoft/multiview-human-pose-estimation-pytorch}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haibo Qiu (18 papers)
  2. Chunyu Wang (43 papers)
  3. Jingdong Wang (236 papers)
  4. Naiyan Wang (65 papers)
  5. Wenjun Zeng (130 papers)
Citations (201)

Summary

  • The paper presents a novel cross-view fusion scheme that leverages CNNs for 2D pose estimation and a Recursive Pictorial Structure Model for iterative 3D refinement.
  • It achieves a significant reduction in joint localization error, improving MPJPE from 77mm to 26mm on the H36M dataset.
  • Releasing the code, the study advances multi-view pose estimation applications in motion capture, surveillance, and other real-world scenarios.

Cross View Fusion for $3$D Human Pose Estimation: An Expert Overview

The paper "Cross View Fusion for $3$D Human Pose Estimation" presents a novel framework for estimating absolute $3$D human poses from multi-view images. The authors propose an approach that leverages multi-view geometric priors through a two-step process: first, estimating $2$D poses in multiple views, followed by recovering $3$D poses from the computed $2$D poses. The key contributions of this work include the introduction of a cross-view fusion scheme integrated into convolutional neural networks (CNNs) for $2$D pose estimation and a Recursive Pictorial Structure Model (RPSM) for $3$D pose recovery.

The proposed cross-view fusion is an innovative method that utilizes CNNs to facilitate $2$D pose estimation by effectively combining information from multiple viewpoints. This fusion technique enables the model to take advantage of complementary data from other views, addressing challenges like occlusion and motion blur that typically degrade the pose estimation performance in single-view systems.

Once the $2$D poses are estimated, the RPSM is employed to refine the $3$D pose. The RPSM extends the traditional Pictorial Structure Model (PSM) by iteratively improving the pose estimation accuracy. Unlike PSM, which is burdened by quantization errors from space discretization, RPSM recursively refines joint locations through a multi-stage process. This recursive approach allows for finely grained spatial resolution without incurring prohibitive computational costs, effectively improving the accuracy in 3D joint localization from $77$mm to $26$mm on the H36M dataset and demonstrating a significant reduction compared to the state-of-the-art methods.

The paper reports a Mean Per Joint Position Error (MPJPE) of $26$mm and $29$mm on the H36M and Total Capture datasets, respectively, showcasing a substantial improvement over existing approaches with errors of $52$mm and $35$mm. Such performance enhancement underscores the efficacy of joint CNN-based feature fusion and recursive optimization in the RPSM framework.

By releasing the code, the authors have facilitated replication and further exploration, making a valuable contribution to the field of $3$D human pose estimation. Practically, this research implies more accurate human pose detection in applications reliant on multi-camera systems, such as motion capture and surveillance. Theoretically, it provides insights into multi-view learning and spatial reasoning in deep learning contexts.

Future developments may focus on adapting this framework to more complex scenes with dynamic backgrounds and extended testing on a diverse set of subjects and actions to assess its generalization capabilities. Furthermore, exploring adaptations that eliminate the need for camera calibration could enhance the system's applicability in less controlled environments. Integrating these advancements into commercial and industrial applications has the potential to revolutionize sectors relying on human pose analysis, such as entertainment, healthcare, and sports.