Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation (2308.07313v1)

Published 14 Aug 2023 in cs.CV

Abstract: In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e.g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR. We present a simple yet effective transformer approach, named Group Pose. We simply regard $K$-keypoint pose estimation as predicting a set of $N\times K$ keypoint positions, each from a keypoint query, as well as representing each pose with an instance query for scoring $N$ pose predictions. Motivated by the intuition that the interaction, among across-instance queries of different types, is not directly helpful, we make a simple modification to decoder self-attention. We replace single self-attention over all the $N\times(K+1)$ queries with two subsequent group self-attentions: (i) $N$ within-instance self-attention, with each over $K$ keypoint queries and one instance query, and (ii) $(K+1)$ same-type across-instance self-attention, each over $N$ queries of the same type. The resulting decoder removes the interaction among across-instance type-different queries, easing the optimization and thus improving the performance. Experimental results on MS COCO and CrowdPose show that our approach without human box supervision is superior to previous methods with complex decoders, and even is slightly better than ED-Pose that uses human box supervision. $\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm Paddle}$ and $\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch}$ code are available.

Citations (19)

View on Semantic Scholar

Summary

The paper presents a novel transformer decoder architecture that uses group self-attention to directly predict keypoint positions for multi-person pose estimation.
It achieves impressive results on MS COCO with AP scores of 72.0 using ResNet-50 and 74.8 using Swin-Large, outperforming more complex models.
By simplifying decoder complexity, the approach enhances both training efficiency and real-time application potential in challenging environments.

Summary of "Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation"

The paper "Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation" presents a novel methodology for addressing multi-person pose estimation using transformer-based architectures. The authors critique existing state-of-the-art approaches for their complexity and introduce a more streamlined solution termed "Group Pose." This approach sets itself apart by leveraging a simple yet efficient transformer decoder architecture to improve both the efficacy and simplicity of pose estimation models.

Core Methodology

The Group Pose framework deviates from the convoluted decoders found in contemporary models like PETR and ED-Pose. It approaches pose estimation by predicting a set of keypoint positions directly through a transformer decoder, which uses separate keypoint and instance queries. Specifically, the paper introduces a novel decoder self-attention mechanism—it replaces standard self-attention with two types of group self-attentions: within-instance self-attention and same-type across-instance self-attention. This modification reduces non-productive interactions among queries of varying types, ostensibly leading to more streamlined optimization and enhanced performance.

Key Results

The proposed method demonstrates significant performance improvements over existing methods. On the MS COCO benchmark, Group Pose achieves an average precision (AP) of 72.0 and 74.8 using ResNet-50 and Swin-Large backbones, respectively, outpacing various state-of-the-art models. Moreover, it surpasses even those models with additional supervision inputs, like human detection tasks. The superiority of Group Pose is evident not just in numerical benchmarks but also in its ability to address real-world challenges such as scale variations, motion blur, crowded scenes, and pose deformations.

Implications

Group Pose not only offers a competitive alternative in end-to-end multi-person pose estimation but also signifies a shift towards simplified model architectures. By removing unnecessary complexity from the design, it promotes more efficient training and inference speeds. The implications extend to various applications, including real-time video analysis and robotics, where swift and accurate human pose estimation is crucial.

Future Directions

The authors acknowledge some limitations, particularly in scenarios involving small parts of human instances. Future research could focus on integrating more sophisticated feature extraction techniques or dynamic attention mechanisms to address these edge cases. Additionally, the flexibility of the Group Pose framework provides a foundation for adaptation and integration with other tasks such as action recognition or multi-camera tracking in 3D spaces, broadening its applicability in advanced AI systems.

Overall, "Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation" presents a compelling step forward in the field of pose estimation, stressing the balance between simplicity and performance—an aspect that is paramount as models scale to increasingly complex datasets and environments.

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation (2308.07313v1)

Summary

Summary of "Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation"

Core Methodology

Key Results

Implications

Future Directions

GitHub

YouTube

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation (2308.07313v1)

Summary

Summary of "Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation"

Core Methodology

Key Results

Implications

Future Directions

Related Papers

GitHub

YouTube