Effective Whole-body Pose Estimation with Two-stages Distillation (2307.15880v2)

Published 29 Jul 2023 in cs.CV

Abstract: Whole-body pose estimation localizes the human body, hand, face, and foot keypoints in an image. This task is challenging due to multi-scale body parts, fine-grained localization for low-resolution regions, and data scarcity. Meanwhile, applying a highly efficient and accurate pose estimator to widely human-centric understanding and generation tasks is urgent. In this work, we present a two-stage pose \textbf{D}istillation for \textbf{W}hole-body \textbf{P}ose estimators, named \textbf{DWPose}, to improve their effectiveness and efficiency. The first-stage distillation designs a weight-decay strategy while utilizing a teacher's intermediate feature and final logits with both visible and invisible keypoints to supervise the student from scratch. The second stage distills the student model itself to further improve performance. Different from the previous self-knowledge distillation, this stage finetunes the student's head with only 20% training time as a plug-and-play training strategy. For data limitations, we explore the UBody dataset that contains diverse facial expressions and hand gestures for real-life applications. Comprehensive experiments show the superiority of our proposed simple yet effective methods. We achieve new state-of-the-art performance on COCO-WholeBody, significantly boosting the whole-body AP of RTMPose-l from 64.8% to 66.5%, even surpassing RTMPose-x teacher with 65.3% AP. We release a series of models with different sizes, from tiny to large, for satisfying various downstream tasks. Our codes and models are available at https://github.com/IDEA-Research/DWPose.

Citations (107)

View on Semantic Scholar

Summary

The paper presents a two-stage distillation process that enhances whole-body pose estimation by transferring knowledge from a pre-trained teacher model and self-knowledge refinement.
The first stage leverages weight-decay regulated supervision to align intermediate features and final logits for precise localization of body, hand, face, and foot keypoints.
Evaluation on datasets like COCO-WholeBody shows notable AP improvements across all keypoint areas, underlining the method’s efficiency and broad applicability.

Effective Whole-body Pose Estimation with Two-stages Distillation

The paper "Effective Whole-body Pose Estimation with Two-stages Distillation" introduces a novel approach to enhancing the accuracy and efficiency of whole-body pose estimation by deploying a method known as DWPose. This method integrates a two-stage distillation process into pose estimation tasks, primarily focusing on whole-body keypoint localization. The paper addresses challenges such as multi-scale body parts, fine-grained localization, and data scarcity which complicate the process of accurately estimating human poses comprising body, hand, face, and foot keypoints.

Two-stage Distillation Method

First-stage Distillation: The paper proposes a first-stage distillation strategy where a pre-trained teacher model supervises a student model. This supervision involves transferring knowledge from the teacher via both intermediate features and final logits, encompassing visible and invisible keypoints. Employing a weight-decay strategy during training aids in maintaining the balance between model training efficiency and accuracy enhancement. This stage focuses on leveraging a weight-decay mechanism that gradually reduces the distillation's weight, thereby allowing for balanced learning.
Second-stage Distillation: In this phase, a refined approach to self-knowledge distillation is adopted. Without additional labeled data, the student model is fine-tuned with its own previously learned knowledge via logit-based distillation. This is achieved by training only the model's head with fixed backbone parameters, a process notably efficient, requiring merely 20% of the typical training duration.

Dataset Utilization and Performance

The paper explores the integration of the UBody dataset—highlighting diverse facial expressions and hand gestures—which significantly aids practical applications owing to its real-life scene annotations. By evaluating the models on datasets such as COCO-WholeBody, the research finds that with distillation, the whole-body Average Precision (AP) notably improves from baseline. For instance, the AP on RTMPose-l (large) moves from 64.8% to 66.5%, surpassing the teacher model's performance. Furthermore, it observes AP enhancements across body, foot, face, and hands, advocating the distillation's efficacy and the additional data's contribution to fine-grained pose detection.

Implications and Future Scope

This research posits implications for various human-centric generation and understanding tasks, such as virtual content creation, AR/VR applications, and human-object interaction. The introduction of models spanning different sizes caters to the varied requirements of downstream applications. The findings can influence future trends in AI by highlighting the efficacy of distillation techniques in model compression and performance optimization.

With releases of more optimal pose estimation models, the paper establishes a foundation for future methodologies that could integrate more sophisticated distillation techniques or explore alternative datasets to enhance pose estimation capabilities further. Additionally, leveraging DWPose in vision models can significantly impact fields utilizing pose-conditioned human image generation, which often relies on skeleton-guided generation tasks. As the authors made the models and code accessible, they facilitate further exploration and advancement within the AI research community.

PDF Markdown

Related Papers

YouTube

Show All Videos