AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

Published 26 Mar 2024 in cs.CV | (2403.17934v1)

Abstract: Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body, hand, and expression estimation. Most existing methods have tackled this task in a two-stage manner, first detecting the human body part with an off-the-shelf detection model and inferring the different human body parts individually. Despite the impressive results achieved, these methods suffer from 1) loss of valuable contextual information via cropping, 2) introducing distractions, and 3) lacking inter-association among different persons and body parts, inevitably causing performance degradation, especially for crowded scenes. To address these issues, we introduce a novel all-in-one-stage framework, AiOS, for multiple expressive human pose and shape recovery without an additional human detection step. Specifically, our method is built upon DETR, which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection. We devise the decoder tokens and extend them to our task. Specifically, we first employ a human token to probe a human location in the image and encode global features for each instance, which provides a coarse location for the later transformer block. Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature, which collaborates with the global feature to regress the whole-body mesh. This straightforward but effective model outperforms previous state-of-the-art methods by a 9% reduction in NMVE on AGORA, a 30% reduction in PVE on EHF, a 10% reduction in PVE on ARCTIC, and a 3% reduction in PVE on EgoBody.

Abstract PDF HTML Upgrade to Chat

References (59)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces AiOS, a single-stage framework that leverages a progressive detection and decoding strategy to integrate global and local features for expressive human pose and shape estimation.
The paper achieves state-of-the-art performance on benchmarks such as AGORA by significantly improving NMVE, PVE, and other key metrics without using ground truth bounding boxes.
The paper opens new avenues for future research in EHPS and generative AI by addressing limitations of multi-stage methods and improving accuracy in crowded scenes.

All-in-One-Stage Expressive Human Pose and Shape Estimation (AiOS)

Introduction

Expressive Human Pose and Shape Estimation (EHPS), which extends beyond conventional human pose and shape estimation to include hand gestures and facial expressions, has seen substantial advancements but not without challenges. Standard approaches have relied on multi-stage methods, which first detect human body parts and subsequently infer poses and shapes. While delivering impressive results, these techniques suffer from drawbacks, such as loss of context, the introduction of distractions, and a lack of inter- and intra-person correlations, especially in crowded scenes. Addressing these issues, this paper introduces an all-in-one framework, named AiOS, eliminating the need for an additional human detection step and pioneering a single-stage approach for multiple EHPS.

Historically, EHPS relied on parametric models like SMPL-X, with traditional methods adopting a multi-stage framework involving the detection of body parts followed by separate inferences. However, these methods face challenges related to complexity, accuracy, and integration of body parts. One-stage methods, while previously proposed for Human Pose and Shape estimation (HPS), haven't fully addressed the EHPS task due to reliance on global features, showing insufficient capacity for accurate part-wise regression. AiOS is designed to fill this gap, offering a novel solution to incorporate both global and local human features in EHPS within a one-stage model.

AiOS Framework

Built upon the DETR structure, AiOS features a CNN backbone and transformer encoder-decoder structures. It employs a progressive detection and decoding strategy through three main stages:

Body localization stage: Utilizes a human token to encode global features and provide coarse locations.
Body refinement stage: Introduces joint-related tokens for encoding fine-grained local features.
Whole-body refinement stage: Further refines features for accurate whole-body mesh regression.

This progressive approach, leveraging both global and local features, demonstrated superior performance in experiments, achieving state-of-the-art (SOTA) results on multiple benchmarks without relying on ground truth bounding boxes.

Experiments and Results

AiOS was extensively evaluated across several benchmarks, showing its effectiveness by outperforming previous methods in terms of NMVE, PVE, and other metrics on datasets like AGORA, EHF, ARCTIC, and EgoBody. Particularly notable is its performance on the AGORA benchmark, where AiOS's bounding box accuracy significantly improved results compared to conventional two-stage methods. These achievements underscore the potential of AiOS in handling crowded scenes and complex interactions more accurately than existing models.

Implications and Future Directions

The introduction of AiOS as an all-in-one-stage framework for expressive human pose and shape estimation provides not just a methodological advancement but also opens up broader implications for the field of generative AI. By addressing the limitations of traditional multi-stage methods, including the handling of crowded scenes and the integration of local features for accurate pose and shape estimation, AiOS sets a new benchmark for future research. Looking ahead, there's potential for further exploration into extending this approach to include dynamic scenes, improve the model's efficiency, and investigate other applications of DETR-based frameworks in generative AI tasks.

Conclusion

AiOS represents a significant step forward in the field of EHPS by providing an effective one-stage method that incorporates both global and local features for expressing human poses and shapes. With its demonstrated proficiency across multiple benchmarks, AiOS heralds a new direction for research in human understanding systems, offering a blueprint for future developments in AI-based human modeling and interaction analysis.