Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation (2302.01593v1)

Published 3 Feb 2023 in cs.CV

Abstract: This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit box detection processes with a unified representation and regression supervision. First, we introduce a human detection decoder from encoded tokens to extract global features. It can provide a good initialization for the latter keypoint detection, making the training process converge fast. Second, to bring in contextual information near keypoints, we regard pose estimation as a keypoint box detection problem to learn both box positions and contents for each keypoint. A human-to-keypoint detection decoder adopts an interactive learning strategy between human and keypoint features to further enhance global and local feature aggregation. In general, ED-Pose is conceptually simple without post-processing and dense heatmap supervision. It demonstrates its effectiveness and efficiency compared with both two-stage and one-stage methods. Notably, explicit box detection boosts the pose estimation performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and achieves the state-of-the-art with 76.6 AP on CrowdPose without bells and whistles. Code is available at https://github.com/IDEA-Research/ED-Pose.

Citations (44)

View on Semantic Scholar

Summary

The paper introduces ED-Pose, reformulating multi-person pose estimation as explicit box detection, unifying human detection and pose estimation for improved training convergence.
It employs a dual-decoder design that integrates global human features with local keypoint cues, eliminating the need for traditional post-processing steps like NMS.
Empirical results demonstrate significant accuracy gains, including a 9.9 AP improvement on CrowdPose and state-of-the-art performance in complex scenes.

Overview of "Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation"

In the paper titled "Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation," the authors present a novel framework called ED-Pose, which reformulates multi-person pose estimation as an explicit box detection task. This approach integrates global human-level and local keypoint-level information within a single, cohesive end-to-end model. They introduce a detection strategy that considers both the human and keypoint detections as box detection problems, aiming to optimize the convergence speed and precision of pose estimation.

Key Contributions

Unified Box Representation: The authors propose a method where both human detection and pose estimation are treated as explicit box detection tasks. This unification allows for consistent regression supervision, simplifying the training process and eliminating the reliance on post-processing steps such as Non-Maximum Suppression (NMS) and keypoint grouping.
Human Detection Decoder: By introducing a human detection decoder, the method initializes global features effectively, thereby providing a solid starting point for keypoint detection. This hierarchical decoding strategy enhances the training convergence speed significantly.
Human-to-Keypoint Detection Decoder: This decoder aligns the detection tasks of human and keypoint features by implementing an interactive learning strategy. This approach aggregates global and local features in a differentiated manner, addressing the challenges of complex human poses, occlusions, and varying body part scales in crowded scenes.

Empirical Results

The authors present extensive empirical evidence to demonstrate the efficacy of ED-Pose. Notably, the inclusion of explicit human detection improved the pose estimation accuracy by 4.5 AP on COCO and 9.9 AP on the CrowdPose datasets. Furthermore, ED-Pose outperformed heatmap-based top-down methods, achieving a 1.2 AP improvement on COCO and reaching state-of-the-art performance with 76.6 AP on CrowdPose—a testament to its robust performance.

Theoretical and Practical Implications

ED-Pose presents a significant shift in paradigm by simplifying the pose estimation process into a unified end-to-end framework that eschews traditional heatmap supervision. The theoretical implication lies in demonstrating that global-to-local feature alignment can be efficiently achieved through interactive learning, challenging the conventional separation of detection and pose estimation tasks.

Practically, ED-Pose not only reduces computational complexity by sharing encoders between human and keypoint detection but also enhances efficiency, as evidenced by faster convergence rates and reduced inference time compared to existing methods. The reduced computational overhead makes ED-Pose applicable for real-time applications in augmented reality (AR), virtual reality (VR), and human-computer interaction (HCI).

Future Directions

ED-Pose opens several avenues for future exploration. The methodology can potentially adapt to other domains where feature aggregation from global and local contexts is paramount. Furthermore, exploring different architectures or augmentations that enhance the interactive learning modules and the scalability of ED-Pose to 3D pose estimation could provide additional insights and improvements.

In summary, ED-Pose represents a meaningful advancement in the field of multi-person pose estimation by proposing a novel method that effectively unifies detection and pose estimation tasks within a streamlined and efficient framework.

PDF Markdown

Related Papers

GitHub

GitHub - IDEA-Research/ED-Pose: [ICLR 2023] Official implementation of the paper "Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation " (176 stars)