Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot (2402.14654v2)

Published 22 Feb 2024 in cs.CV

Abstract: We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on $448{\times}448$ images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a novel single-shot model that integrates a Vision Transformer and cross-attention for efficient multi-person mesh recovery.
It leverages the SMPL-X model and the unique CUFFS dataset to enhance whole-body pose estimation, including detailed hand and facial expressions.
The approach adapts to camera intrinsics and achieves scalability for real-time processing, setting a new state-of-the-art in human pose estimation.

Multi-HMR Advances Multi-Person Whole-Body Human Mesh Recovery

Introduction

Multi-person whole-body human mesh recovery from a single RGB image presents significant challenges due to the complexity of capturing expressive body poses, effectively handling a variable number of people in a scene, accurately estimating the spatial location of humans, and adapting to camera-specific information when available. The recent development of Multi-Human Mesh Recovery (Multi-HMR) has introduced a robust, single-shot model to address these challenges. Utilizing the SMPL-X parametric model, Multi-HMR can predict whole-body poses—including facial expressions and hand gestures—directly from single shots, setting a new state-of-the-art in human mesh recovery.

Architecture and Method

Multi-HMR integrates a Vision Transformer (ViT) as its backbone to take advantage of recent advancements in large-scale self-supervised learning. The model introduces the Human Prediction Head (HPH), a cross-attention module that significantly improves on earlier regression-based approaches. Unlike previous methods that require multiple stages to process human detection and mesh recovery, Multi-HMR efficiently handles these tasks in a single step. The methodology includes the following innovative features:

Single-Shot Detection and Regression: Multi-HMR detects multiple humans in a single shot and regresses whole-body mesh parameters using a novel cross-attention mechanism, showcasing improved efficiency and accuracy.
CUFFS Dataset Integration: The introduction of a new dataset, CUFFS (Close-Up Frames of Full-Body Subjects), enhances hand pose prediction, further solidifying the model's leading performance in whole-body mesh recovery.
Adaptation to Camera Intrinsics: By optionally incorporating camera intrinsic parameters, Multi-HMR finely tunes its predictions, showcasing adaptability to various camera settings without compromising generalization.
Scalability and Real-Time Processing: The model's design facilitates scalability with respect to input resolution and backbone size, enabling real-time applications with competitive accuracy.

Theoretical Implications

The successful integration of a cross-attention mechanism within the Human Prediction Head represents a novel approach to human mesh recovery. This leap forwards suggests potential for further exploration into attention-based architectures in human pose estimation tasks. Additionally, the employment of the SMPL-X model within a single-shot, multi-person framework highlights the versatility and effectiveness of parametric models in capturing complex, whole-body human dynamics.

Future Directions

Despite the strong performance, areas for further research and improvement have been identified. The challenge of detecting and accurately reconstructing partially occluded or truncated humans presents an opportunity for future work, potentially involving more advanced occlusion handling techniques or adaptive detection thresholds. Moreover, exploring alternative human pose representations may yield additional gains in accuracy and model robustness. Finally, the rapid advancements in self-supervised and transformer-based models offer promising avenues for enhancing backbone architectures, with implications for efficiency and scalability in multi-person human mesh recovery.

Conclusion

Multi-HMR, with its single-shot, multi-person approach, sets a new benchmark in whole-body human mesh recovery. By effectively addressing key challenges in the field, it offers significant improvements in terms of efficiency, adaptability, and accuracy. Future iterations of the model, incorporating further enhancements and optimizations, are poised to push the boundaries of what is achievable in human pose and shape estimation technology.

PDF Markdown

Related Papers

Tweets

https://twitter.com/skalskip92/status/1865668136831483983

https://twitter.com/naverlabseurope/status/1760946986117734588