ViTPose-S: Efficient Pose Estimation
- The paper introduces ViTPose-S, a model using a plain ViT backbone and lightweight decoder to achieve 75.8 AP with 944 fps on COCO.
- ViTPose-S eliminates hand-crafted priors, demonstrating that vanilla transformer architectures can effectively capture spatial pose information.
- The approach utilizes MAE pretraining and advanced fine-tuning protocols, ensuring scalability, flexibility, and high inference speed.
ViTPose-S denotes the small-scale configuration (“ViTPose-B”) of the ViTPose framework, a pure vision transformer (ViT)-based baseline designed for human pose estimation. Eschewing hand-crafted domain priors, ViTPose-S utilizes a plain, non-hierarchical vision transformer as its backbone and a lightweight, task-agnostic decoder. It achieves a strong balance between model efficiency, inference speed, and pose estimation accuracy, establishing a new Pareto front compared to prior approaches. The approach demonstrates the scalability, flexibility, and transferability of vanilla transformer architectures for pose estimation tasks on benchmark datasets (Xu et al., 2022).
1. Model Architecture and Configuration
ViTPose-S employs a standard ViT-B backbone without pose-specific inductive biases. The architecture is composed as follows:
- Backbone: Plain Vision Transformer (ViT-B)
- Number of transformer layers: 12
- Token (hidden) dimension
- Multi-head self-attention (MHSA) heads: 12
- Feed-forward network (MLP) dimension:
- Patch size: (downsampling ratio )
- Total parameters: million
- Decoder: Two-Step Lightweight Head
- Receives last transformer output
- Two consecutive deconvolution blocks:
- Each: Transposed convolution (stride=2, kernel size 4), BatchNorm, ReLU
- Recovers spatial resolution:
- Final convolution maps to keypoint heatmaps ( for COCO)
- Output: ,
All ViTPose models employ an identical decoder head; decoder simplicity highlights the backbone features’ strength. A further simplified decoder (single-step bilinear upscaling plus a conv) yields less than $0.3$ Average Precision (AP) drop.
| Component | Specification | Value |
|---|---|---|
| Transformer | Layers, Hidden, Heads, MLP | 12, 768, 12, 3072 |
| Patch Size | Downsampling | |
| Decoder | Deconv Blocks, Output | 2, Heatmaps |
| Parameters | Model Size | 86M |
| Inference FPS | NVIDIA A100, | 944 |
2. Training Regime and Optimization
ViTPose-S is pretrained and fine-tuned as follows:
- Pretraining: Initialize transformer with Masked Autoencoder (MAE) weights pretrained on ImageNet-1K.
- Finetuning and Task-Specific Training:
- Dataset: COCO Keypoints
- Input size:
- Optimizer: AdamW (, , weight decay $0.1$)
- Learning rate: (base, with 10 decay at epochs 170 and 200)
- Layer-wise learning-rate decay: $0.75$
- Stochastic drop-path rate: $0.30$
- Batch size: $512$
- Training epochs: $210$
- Data augmentation: Scale, rotation, flip (MMPose default)
- Postprocessing: UDP coordinate decoding
This setup preserves the generality of the transformer backbone and maximizes flexibility regarding pretraining sources and finetuning recipes.
3. Inference, Throughput, and Decoder Analysis
Evaluation is conducted with inputs at inference and achieves approximately 944 frames-per-second on an NVIDIA A100 GPU, highlighting significant efficiency.
The two-block decoder is empirically justified:
- Consistent performance across all ViTPose configurations (S, L, H, G)
- “Single-step” decoder (bilinear upsampling + conv): AP drop, suggesting transformer-extracted features are inherently suitable for keypoint regression
Inference is performed without architecture specialization for model scale, further demonstrating the backbone’s adaptability.
4. Performance Benchmarks
On the MS COCO Human Keypoint Detection benchmark, ViTPose-S achieves the following results:
| Metric | Value |
|---|---|
| AP | 75.8 |
| AP | 90.7 |
| AR | 81.1 |
| AR | 94.6 |
| Throughput | 944 fps |
Comparative context:
- SimpleBaseline (ResNet-152): 73.5 AP @ 829 fps
- HRFormer-B (43M): 75.6 AP @ 158 fps
ViTPose-S outperforms or matches prior small- and efficient-model baselines in accuracy and speed (Xu et al., 2022). This establishes a new Pareto front for throughput and accuracy in the pose estimation literature for transformers.
5. Key Mechanisms and Equations
The model’s core computations:
- Transformer Layer Update:
where , and .
- Decoder Head (Two-stage):
- Decoder Head (Single-step):
- Knowledge Token Distillation:
- Teacher , student , ground-truth .
- Optionally:
These formulations underscore the backbone’s modularity and the decoder’s minimalism.
6. Implications and Significance
ViTPose-S illustrates that a straightforward, unmodified ViT backbone suffices for state-of-the-art human pose estimation when paired with a generic, lightweight decoder and modern training protocols. The architecture demonstrates:
- No domain-specific inductive bias is required for strong keypoint localization performance.
- Scaling up or down preserves architectural regularity; the same backbone and head can address diverse pose estimation settings.
- Pretrained transformer representations, particularly from self-supervised MAE on large datasets, transfer well to vision tasks requiring spatial detail.
A plausible implication is that future work in pose estimation can further leverage plain transformer backbones and focus optimization efforts on data, pretraining, and transfer strategies, rather than on architectural specialization (Xu et al., 2022).