BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation (2309.14072v2)

Published 25 Sep 2023 in cs.CV

Abstract: Single-stage multi-person human pose estimation (MPPE) methods have shown great performance improvements, but existing methods fail to disentangle features by individual instances under crowded scenes. In this paper, we propose a bounding box-level instance representation learning called BoIR, which simultaneously solves instance detection, instance disentanglement, and instance-keypoint association problems. Our new instance embedding loss provides a learning signal on the entire area of the image with bounding box annotations, achieving globally consistent and disentangled instance representation. Our method exploits multi-task learning of bottom-up keypoint estimation, bounding box regression, and contrastive instance embedding learning, without additional computational cost during inference. BoIR is effective for crowded scenes, outperforming state-of-the-art on COCO val (0.8 AP), COCO test-dev (0.5 AP), CrowdPose (4.9 AP), and OCHuman (3.5 AP). Code will be available at https://github.com/uyoung-jeong/BoIR

Citations (2)

View on Semantic Scholar

Summary

The paper presents BoIR, an innovative method that leverages bounding box-level contrastive embedding to improve pose estimation in cluttered environments.
It integrates multi-task learning by combining keypoint estimation, bounding box regression, and a novel Bbox Mask Loss for efficient feature disentanglement.
Experimental results reveal significant AP gains—up to 0.8 on COCO and 4.9 on CrowdPose—demonstrating the method’s robust performance in real-world scenarios.

BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation

The paper "BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation" proposes a novel approach to enhance multi-person human pose estimation (MPPE) in crowded scenes through a bounding box-level instance representation learning scheme. The authors introduce a method termed BoIR, which addresses the challenges of instance detection, disentanglement, and keypoint association within a single-stage framework. This approach is particularly focused on the challenges posed by crowded or cluttered environments where traditional single-stage methods struggle.

Methodological Contributions

Bounding Box-Level Instance Representation (BoIR): The authors empower the MPPE framework by incorporating a bounding box-level, contrastive instance embedding loss. This loss enriches the learning signal across the entire image, providing a globally coherent and disentangled instance representation.
Multi-Task Learning: The paper effectively combines bottom-up keypoint estimation with bounding box regression and contrastive instance embedding learning. Notably, this integration does not impose additional computational overhead during inference, making the method efficient in practical applications.
Bbox Mask Loss: A novel contrastive learning mechanism termed Bbox Mask Loss is introduced, which disentangles features by leveraging bounding box annotations. This loss comprises multiple components, including in-box pull, out-box push, and cross-instance push terms, which together enhance the robustness of feature disentanglement.
Enhanced Network Architecture: The authors utilize enhancements within the network architecture, such as ASPP for multi-resolution features and auxiliary task heads, to aid in feature learning without degrading the primary task performance.

Experimental Results

Performance on Benchmark Datasets: The proposed BoIR method demonstrates superior performance across several challenging datasets, including COCO, CrowdPose, and OCHuman. Particularly, it achieves significant improvements in crowded scenarios, evidenced by metrics such as a 0.8 AP improvement on COCO val and a 4.9 AP improvement on CrowdPose test compared to state-of-the-art methods.
Comparison with Existing Methods: When compared to alternatives like CID and ED-Pose, BoIR consistently shows enhanced capability in both AP and AR, especially in crowded environments, thus validating the efficacy of the bounding box-level instance representation approach.

Implications and Future Work

The ability of BoIR to utilize bounding boxes effectively for instance representation has both practical and theoretical implications. Practically, it offers a robust tool for real-world applications such as surveillance and autonomous driving, where instances are often occluded or overlapping. Theoretically, the work contributes to the broader understanding of integrating multi-task learning with contrastive losses effectively.

For future directions, a key area of development could involve exploring further auxiliary tasks within the framework, incorporating other modalities of data to aid in learning richer feature representations. Additionally, addressing the limitation of representation learning on smaller datasets remains a challenge and presents an opportunity for subsequent improvements in this research domain.

Overall, BoIR represents a meaningful step forward in MPPE, particularly under complex visual scenes, by effectively combining bounding box supervision with embedding-level learning for improved instance representation and prediction accuracy.

Related Papers

GitHub

GitHub - uyoung-jeong/BoIR: BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation (96 stars)