ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation (2204.12484v3)

Published 26 Apr 2022 in cs.CV

Abstract: Although no specific domain knowledge is considered in the design, plain vision transformers have shown excellent performance in visual recognition tasks. However, little effort has been made to reveal the potential of such simple structures for pose estimation tasks. In this paper, we show the surprisingly good capabilities of plain vision transformers for pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model called ViTPose. Specifically, ViTPose employs plain and non-hierarchical vision transformers as backbones to extract features for a given person instance and a lightweight decoder for pose estimation. It can be scaled up from 100M to 1B parameters by taking the advantages of the scalable model capacity and high parallelism of transformers, setting a new Pareto front between throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, pre-training and finetuning strategy, as well as dealing with multiple pose tasks. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our basic ViTPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art. The code and models are available at https://github.com/ViTAE-Transformer/ViTPose.

Citations (417)

View on Semantic Scholar

Summary

The paper introduces ViTPose, a novel method using plain vision transformers for human pose estimation without domain-specific modifications.
It leverages a scalable architecture that adapts from 100M to 1B parameters, achieving competitive performance with simplified design.
The approach demonstrates effective transfer learning through a learnable knowledge token, reducing training data needs and enhancing model efficiency.

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

The paper presents ViTPose, a novel approach utilizing plain vision transformers for human pose estimation tasks. This work explores the capabilities of vision transformers without domain-specific architectural adjustments. The authors introduce ViTPose, leveraging non-hierarchical vision transformers as backbones, paired with a streamlined and efficient decoder for pose estimation.

The ViTPose model demonstrates several noteworthy attributes: simplicity, scalability, flexibility, and transferability. By employing plain vision transformers, the structural complexity is significantly reduced, requiring minimal adjustments for effective performance. The scalability of ViTPose is evident in its capacity to adapt from 100 million to 1 billion parameters, maintaining high parallelism—a marked advantage in the trade-off between throughput and performance.

Key experimental highlights include ViTPose achieving a performance of 80.9 AP on the MS COCO test-dev set, surpassing existing benchmarks. One of the standout achievements of ViTPose is its capacity to improve smaller models by transferring knowledge from larger variants using a novel learnable knowledge token.

Experimental Insights

The paper elaborates on ViTPose's adaptability across multiple facets:

Pre-Training Data Flexibility: ViTPose showcases its versatility by yielding competitive results even when pre-trained on datasets considerably smaller than ImageNet, such as MS COCO.
Resolution and Attention Mechanism Adaptability: The model's performance improves with increased input resolution. Furthermore, it accommodates various attention mechanisms to optimize resources without sacrificing accuracy.
Partial Finetuning and Multi-Dataset Training: ViTPose excels in scenarios where only limited components are finetuned, demonstrating significant reductions in requisite training data and computational resources.
Transferability: The paper underscores the potential of using a learnable knowledge token for efficient transfer learning, a novel approach meriting further exploration.

Comparative Analysis

ViTPose's performance was evaluated against various competitive methodologies. Notably, it outperformed traditional convolutional networks and other recent transformer-based models on standard benchmarks, achieving superior results on both the MS COCO and MPII datasets without elaborate architectural changes.

Theoretical and Practical Implications

The implications of ViTPose are multi-faceted. Theoretically, it questions the necessity of domain-specific augmentations in transformer models for pose estimation. Practically, this research provides a new benchmark for simplicity combined with performance, paving the way for future scaling of vision transformer architectures across diverse visual tasks.

Future Directions

This paper opens possible avenues for research into how plain vision transformers can be further optimized for other tasks beyond pose estimation. The exploration of more advanced interaction techniques between tokens and further leveraging combined datasets could propel improvements in transformers' versatility in real-world applications.

Overall, ViTPose is poised to influence future developments in both model architecture simplicity and enhanced performance capabilities within the field of computer vision.

PDF Markdown

Related Papers

GitHub

GitHub - ViTAE-Transformer/ViTPose: The official repo for [NeurIPS'22] "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation" and [TPAMI'23] "ViTPose++: Vision Transformer for Generic Body Pose Estimation" (1,654 stars)

Tweets

https://twitter.com/mp_coder/status/1882434733256429939