Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Single-Stage Multi-Person Pose Machines (1908.09220v1)

Published 24 Aug 2019 in cs.CV

Abstract: Multi-person pose estimation is a challenging problem. Existing methods are mostly two-stage based--one stage for proposal generation and the other for allocating poses to corresponding persons. However, such two-stage methods generally suffer low efficiency. In this work, we present the first single-stage model, Single-stage multi-person Pose Machine (SPM), to simplify the pipeline and lift the efficiency for multi-person pose estimation. To achieve this, we propose a novel Structured Pose Representation (SPR) that unifies person instance and body joint position representations. Based on SPR, we develop the SPM model that can directly predict structured poses for multiple persons in a single stage, and thus offer a more compact pipeline and attractive efficiency advantage over two-stage methods. In particular, SPR introduces the root joints to indicate different person instances and human body joint positions are encoded into their displacements w.r.t. the roots. To better predict long-range displacements for some joints, SPR is further extended to hierarchical representations. Based on SPR, SPM can efficiently perform multi-person poses estimation by simultaneously predicting root joints (location of instances) and body joint displacements via CNNs. Moreover, to demonstrate the generality of SPM, we also apply it to multi-person 3D pose estimation. Comprehensive experiments on benchmarks MPII, extended PASCAL-Person-Part, MSCOCO and CMU Panoptic clearly demonstrate the state-of-the-art efficiency of SPM for multi-person 2D/3D pose estimation, together with outstanding accuracy.

Citations (205)

Summary

  • The paper presents a novel single-stage framework that unifies person detection and joint localization.
  • It introduces a Structured Pose Representation with root joints and displacement maps to streamline estimation.
  • Empirical results on multiple benchmarks demonstrate state-of-the-art accuracy with significant speed improvements.

An Essay on Single-Stage Multi-Person Pose Machines

The paper "Single-Stage Multi-Person Pose Machines" presents a significant advancement in the field of multi-person pose estimation by introducing a novel approach that consolidates the estimation process into a single stage. This work diverges from conventional two-stage methodologies, which traditionally involve a sequential pipeline for generating proposals and subsequently assigning poses to individual person instances. These two-stage approaches, albeit accurate, often exhibit inefficiencies due to the additive complexity of their pipeline.

The authors of this paper propose a Single-stage multi-person Pose Machine (SPM) that simplifies the pose estimation process and enhances computational efficiency. The core innovation lies in the development of the Structured Pose Representation (SPR), which unifies person instance and body joint position information. This unification allows for the simultaneous prediction of multiple person poses in one singular computational step, thereby eliminating the need for intermediate stages that traditionally separate person detachment and pose localization. SPR employs "root joints" to signify unique person instances, and body joint locations are encoded in terms of their displacement relative to these root joints. This representation is further refined into a hierarchical variant to better accommodate long-range displacement prediction, crucial for estimating positions of distal joints such as wrists and ankles.

SPM leverages Convolutional Neural Networks (CNNs) built upon the Hourglass network architecture to achieve end-to-end learning for pose estimation. The proposed method includes producing root joint confidence maps and dense body joint displacement maps, allowing it to robustly identify and localize human body joints across multiple person instances within a single image, even from challenging scenarios with significant pose variation, occlusions, and cluttered backgrounds.

Empirical evaluations conducted on several benchmarks, namely MPII, extended PASCAL-Person-Part, MSCOCO, and CMU Panoptic datasets, substantiate the model's efficacy. On the MPII dataset, SPM achieves a mAP of 78.5%, outperforming prior methods, while significantly reducing inference time to 0.058 seconds per image. Similarly, on the extended PASCAL-Person-Part dataset, SPM sets a new state-of-the-art with a mAP of 46.1%. For MSCOCO, SPM maintains competitive accuracy with an AP of 0.669, demonstrating its robust performance. Furthermore, on the CMU Panoptic dataset for 3D pose estimation, SPM delivers promising results with a 3D-PCK@150mm score of 77.8%, showcasing its versatility in extending the approach to three-dimensional scenarios.

The implication of this research is multifaceted. Practically, it offers a more compact and computationally efficient framework for multi-person pose estimation that can be crucial for real-time applications in domains such as video surveillance, human-computer interaction, and virtual reality. Theoretically, the introduction of structured and hierarchical pose representations paves the way for further exploration into refining pose estimation models to handle more nuanced human body configurations and interactions within collective scenes.

Looking toward future developments, this single-stage methodology could be expanded into other aspects of human activity recognition and scene understanding, potentially integrating with other sensory modalities or extending to accommodate even higher levels of complexity such as handling rapidly changing scenes or more intricate human-object interactions. Additionally, ongoing improvements in the underlying algorithms and hardware could further bolster the applicability of such models in real-world scenarios.