SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation (2309.17448v3)

Published 29 Sep 2023 in cs.CV

Abstract: Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4.5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. 2) For the model scaling, we take advantage of vision transformers to study the scaling law of model sizes in EHPS. Moreover, our finetuning strategy turn SMPLer-X into specialist models, allowing them to achieve further performance boosts. Notably, our foundation model SMPLer-X consistently delivers state-of-the-art results on seven benchmarks such as AGORA (107.2 mm NMVE), UBody (57.4 mm PVE), EgoBody (63.6 mm PVE), and EHF (62.3 mm PVE without finetuning). Homepage: https://caizhongang.github.io/projects/SMPLer-X/

Citations (52)

View on Semantic Scholar

Summary

The paper introduces SMPLer-X, a generalist foundation model that scales human pose and shape estimation with extensive data and model scaling.
It leverages Vision Transformers and evaluates 32 diverse EHPS datasets to identify key training scenarios and enhance performance metrics.
SMPLer-X achieves state-of-the-art results, including a 107.2 mm NMVE on AGORA, demonstrating robust transferability to unseen environments.

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

The paper "SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation" provides a detailed exploration into the field of Expressive Human Pose and Shape Estimation (EHPS), by scaling up both the data and the model architecture. This research is noteworthy for its attempt to construct the first generalist foundation model in this domain, named SMPLer-X, leveraging Vision Transformers (ViTs) as backbones and extensive training on a diverse dataset assembly.

Key Contributions

Data-Driven Enhancement: The paper undertakes a systematic evaluation of 32 EHPS datasets, encompassing a wide range of scenarios, to identify the most impactful datasets for use in training. Through this comprehensive analysis, they highlight significant discrepancies among individual datasets which underscore the complexities inherent in EHPS tasks.
Model Scaling Exploration: By employing Vision Transformers of increasing sizes, the paper investigates the scaling laws pertinent to model sizes within the EHPS context. They strategically fine-tune SMPLer-X, transitioning it into specialist models, achieving remarkable improvements in performance metrics.
Benchmarking and Transferability: SMPLer-X exhibits exceptional performance across various benchmarks, achieving state-of-the-art results. Notably, SMPLer-X demonstrated mean errors of 107.2 mm NMVE on AGORA without fine-tuning, a significant leap from previous benchmarks, highlighting its robustness and transferability to unseen environments.

Methodology

SMPLer-X’s architecture is minimalistic yet highly efficient, designed to be scalable and adaptable. It includes:

Backbone: A Vision Transformer that processes images into feature tokens.
Neck: Intermediate layers that predict region proposals for hands and face.
Heads: Separate modules to estimate parameters for the body, hands, and face.

Data Scaling

The systematic investigation encompassed datasets across vastly different environments, poses, and visibility scenarios. Noteworthy insights include:

Quantity vs. Quality: Datasets exceeding 100K instances do not significantly benefit further from increased size, while diversity and scene variety are crucial.
Synthetic Data Utilization: Synthetic datasets are surprisingly effective, with images closely mimicking real-world scenarios, thus efficiently bridging domain gaps.
Pseudo SMPL-X Labels: These are beneficial when ground truth annotations are unavailable, though care must be taken due to differences in parameter spaces (especially between SMPL and SMPL-X).

Model Scaling

Through the use of ViT-Small, Base, Large, and Huge backbones, SMPLer-X was trained progressively with datasets scaling from 0.75M to 4.5M instances. The larger models demonstrated reduced mean primary errors and faster convergence, albeit with diminishing returns beyond certain sizes.

Experimental Results

Benchmark Performance

The performance of SMPLer-X was validated across multiple benchmarks:

AGORA: SMPLer-X achieved state-of-the-art results with 107.2 mm NMVE.
UBody: NMJE values saw a notable improvement, significantly enhancing detection precision.
EgoBody: Despite scene-specific complexity, SMPLer-X performed remarkably well in occlusion-heavy scenarios.
3DPW and EHF: Showcased strong generalization capabilities, with consistent performance boosts.

Transferability

ARCTIC and DNA-Rendering: The foundation model achieved superior results on these diverse datasets, emphasizing the effective knowledge transferability from synthetic to real-world scenarios.

Implications and Future Directions

The implications of this work are twofold:

Practical Applications: The generalist foundation model SMPLer-X significantly simplifies the deployment of EHPS systems across various industries, such as animation, gaming, and virtual try-on applications.
Theoretical Advancements: The insights gained from data and model scaling provide a robust baseline for future research, urging the field towards more generalized, data-efficient models.

Future developments may involve deeper explorations into architectural innovations beyond ViTs, adaptation strategies for even larger synthetic datasets, and the integration of multimodal data (e.g., combining RGB with depth sensors).

Conclusion

The paper presented in this paper is a comprehensive examination of the prospects brought by scaling EHPS models and data. By constructing SMPLer-X, the first generalist foundation model in EHPS, the research not only demonstrates superior benchmark performance but also paves the way for scalable, transferable pose and shape estimation frameworks. The findings enrich the current understanding and provide a solid foundation upon which future advancements can be built.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/Gradio/status/1762153958636675130

https://twitter.com/dimid_ml/status/1768290998420214094

https://twitter.com/9Knowled9e/status/1762194412199891297

YouTube

Show All Videos