Papers
Topics
Authors
Recent
Search
2000 character limit reached

AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

Published 26 Mar 2024 in cs.CV | (2403.17934v1)

Abstract: Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body, hand, and expression estimation. Most existing methods have tackled this task in a two-stage manner, first detecting the human body part with an off-the-shelf detection model and inferring the different human body parts individually. Despite the impressive results achieved, these methods suffer from 1) loss of valuable contextual information via cropping, 2) introducing distractions, and 3) lacking inter-association among different persons and body parts, inevitably causing performance degradation, especially for crowded scenes. To address these issues, we introduce a novel all-in-one-stage framework, AiOS, for multiple expressive human pose and shape recovery without an additional human detection step. Specifically, our method is built upon DETR, which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection. We devise the decoder tokens and extend them to our task. Specifically, we first employ a human token to probe a human location in the image and encode global features for each instance, which provides a coarse location for the later transformer block. Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature, which collaborates with the global feature to regress the whole-body mesh. This straightforward but effective model outperforms previous state-of-the-art methods by a 9% reduction in NMVE on AGORA, a 30% reduction in PVE on EHF, a 10% reduction in PVE on ARCTIC, and a 3% reduction in PVE on EgoBody.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023.
  2. Humman: Multi-modal 4d human dataset for versatile sensing and modeling. October 2022.
  3. Smpler-x: Scaling up expressive human pose and shape estimation. arXiv preprint arXiv:2309.17448, 2023.
  4. Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
  5. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  6. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  7. Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In ECCV, pages 342–359. Springer, 2022.
  8. Monocular expressive body regression through body-driven attention. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 20–40. Springer, 2020.
  9. MMHuman3D Contributors. Openmmlab 3d human parametric model toolbox and benchmark. https://github.com/open-mmlab/mmhuman3d, 2021.
  10. Tore: Token reduction for efficient human mesh recovery with transformer. arXiv preprint arXiv:2211.10705, 2022.
  11. Arctic: A dataset for dexterous bimanual hand-object manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12943–12954, 2023.
  12. Learning analytical posterior probability for human mesh recovery. In CVPR, 2023.
  13. Collaborative regression of expressive bodies using moderation. In International Conference on 3D Vision (3DV), pages 792–804, Dec. 2021.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  15. End-to-end recovery of human shape and pose. In CVPR, pages 7122–7131, 2018.
  16. Pare: Part attention regressor for 3d human body estimation. In ICCV, pages 11127–11137, 2021.
  17. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, pages 2252–2261, 2019.
  18. Convolutional mesh regression for single-image human shape reconstruction. In CVPR, pages 4501–4510, 2019.
  19. NIKI: Neural inverse kinematics with invertible neural networks for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
  20. Hybrik-x: Hybrid analytical-neural inverse kinematics for whole-body mesh recovery. arXiv preprint arXiv:2304.05690, 2023.
  21. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, pages 3383–3393. Computer Vision Foundation / IEEE, 2021.
  22. Cliff: Carrying location information in full frames into human pose and shape estimation. pages 590–606, 2022.
  23. One-stage 3d whole-body mesh recovery with component aware transformer. 2023.
  24. Mesh graphormer. In ICCV, pages 12939–12948, 2021.
  25. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  26. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  27. DAB-DETR: Dynamic anchor boxes are better queries for DETR. In International Conference on Learning Representations, 2022.
  28. Smpl: A skinned multi-person linear model. ACM TOG, 34(6):1–16, 2015.
  29. Accurate 3d hand pose estimation for whole-body 3d human mesh estimation. In CVPRW, 2022.
  30. Neuralannot: Neural annotator for 3d human mesh training sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2299–2307, 2022.
  31. Towards robust and expressive whole-body human pose and shape estimation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  32. AGORA: Avatars in geography optimized for regression analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13468–13478, 2021.
  33. Monocular expressive body regression through body-driven attention. ECCV, 2020.
  34. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
  35. Generalized intersection over union: A metric and a loss for bounding box regression. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
  36. Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In IEEE International Conference on Computer Vision Workshops, 2021.
  37. Pliks: A pseudo-linear inverse kinematic solver for 3d human body estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 574–584, 2023.
  38. End-to-end multi-person pose estimation with transformers. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11059–11068, 2022.
  39. A local correspondence-aware hybrid cnn-gcn model for single-image human body reconstruction. IEEE Transactions on Multimedia, 25:4679–4690, 2023.
  40. Monocular, one-stage, regression of multiple 3d people. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11179–11188, 2021.
  41. Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8856–8866, 2023.
  42. Putting people in their place: Monocular regression of 3d people in depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13243–13252, 2022.
  43. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  44. Zolly: Zoom focal length correctly for perspective-distorted human mesh reconstruction. arXiv preprint arXiv:2303.13796, 2023.
  45. Learning dense uv completion for human mesh recovery. arXiv preprint arXiv:2307.11074, 2023.
  46. Neural interactive keypoint detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15122–15132, 2023.
  47. Explicit box detection unifies end-to-end multi-person pose estimation. arXiv preprint arXiv:2302.01593, 2023.
  48. Synbody: Synthetic dataset with layered human models for 3d human perception and modeling. arXiv preprint arXiv:2303.17368, 2023.
  49. Deciwatch: A simple baseline for 10×\times× efficient 2d and 3d pose estimation. In European Conference on Computer Vision, pages 607–624. Springer, 2022.
  50. Smoothnet: A plug-and-play network for refining human poses in videos. In European Conference on Computer Vision, pages 625–642. Springer, 2022.
  51. Motr: End-to-end multiple-object tracking with transformer. In European Conference on Computer Vision (ECCV), 2022.
  52. 3d human mesh regression with dense correspondence. In CVPR, pages 7054–7063, 2020.
  53. Mining the benefits of two-stage and one-stage hoi detection. Advances in Neural Information Processing Systems, 34:17209–17220, 2021.
  54. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
  55. Pymaf-x: Towards well-aligned full-body model regression from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  56. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In ICCV, 2021.
  57. Egobody: Human body shape and motion of interacting people from head-mounted devices. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI, pages 180–200. Springer, 2022.
  58. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  59. End-to-end human object interaction detection with hoi transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11825–11834, 2021.
Citations (8)

Summary

  • The paper introduces AiOS, a single-stage framework that leverages a progressive detection and decoding strategy to integrate global and local features for expressive human pose and shape estimation.
  • The paper achieves state-of-the-art performance on benchmarks such as AGORA by significantly improving NMVE, PVE, and other key metrics without using ground truth bounding boxes.
  • The paper opens new avenues for future research in EHPS and generative AI by addressing limitations of multi-stage methods and improving accuracy in crowded scenes.

All-in-One-Stage Expressive Human Pose and Shape Estimation (AiOS)

Introduction

Expressive Human Pose and Shape Estimation (EHPS), which extends beyond conventional human pose and shape estimation to include hand gestures and facial expressions, has seen substantial advancements but not without challenges. Standard approaches have relied on multi-stage methods, which first detect human body parts and subsequently infer poses and shapes. While delivering impressive results, these techniques suffer from drawbacks, such as loss of context, the introduction of distractions, and a lack of inter- and intra-person correlations, especially in crowded scenes. Addressing these issues, this paper introduces an all-in-one framework, named AiOS, eliminating the need for an additional human detection step and pioneering a single-stage approach for multiple EHPS.

Historically, EHPS relied on parametric models like SMPL-X, with traditional methods adopting a multi-stage framework involving the detection of body parts followed by separate inferences. However, these methods face challenges related to complexity, accuracy, and integration of body parts. One-stage methods, while previously proposed for Human Pose and Shape estimation (HPS), haven't fully addressed the EHPS task due to reliance on global features, showing insufficient capacity for accurate part-wise regression. AiOS is designed to fill this gap, offering a novel solution to incorporate both global and local human features in EHPS within a one-stage model.

AiOS Framework

Built upon the DETR structure, AiOS features a CNN backbone and transformer encoder-decoder structures. It employs a progressive detection and decoding strategy through three main stages:

  • Body localization stage: Utilizes a human token to encode global features and provide coarse locations.
  • Body refinement stage: Introduces joint-related tokens for encoding fine-grained local features.
  • Whole-body refinement stage: Further refines features for accurate whole-body mesh regression.

This progressive approach, leveraging both global and local features, demonstrated superior performance in experiments, achieving state-of-the-art (SOTA) results on multiple benchmarks without relying on ground truth bounding boxes.

Experiments and Results

AiOS was extensively evaluated across several benchmarks, showing its effectiveness by outperforming previous methods in terms of NMVE, PVE, and other metrics on datasets like AGORA, EHF, ARCTIC, and EgoBody. Particularly notable is its performance on the AGORA benchmark, where AiOS's bounding box accuracy significantly improved results compared to conventional two-stage methods. These achievements underscore the potential of AiOS in handling crowded scenes and complex interactions more accurately than existing models.

Implications and Future Directions

The introduction of AiOS as an all-in-one-stage framework for expressive human pose and shape estimation provides not just a methodological advancement but also opens up broader implications for the field of generative AI. By addressing the limitations of traditional multi-stage methods, including the handling of crowded scenes and the integration of local features for accurate pose and shape estimation, AiOS sets a new benchmark for future research. Looking ahead, there's potential for further exploration into extending this approach to include dynamic scenes, improve the model's efficiency, and investigate other applications of DETR-based frameworks in generative AI tasks.

Conclusion

AiOS represents a significant step forward in the field of EHPS by providing an effective one-stage method that incorporates both global and local features for expressing human poses and shapes. With its demonstrated proficiency across multiple benchmarks, AiOS heralds a new direction for research in human understanding systems, offering a blueprint for future developments in AI-based human modeling and interaction analysis.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 9 likes about this paper.