ViTPose-S: Efficient Pose Estimation

Updated 27 January 2026

The paper introduces ViTPose-S, a model using a plain ViT backbone and lightweight decoder to achieve 75.8 AP with 944 fps on COCO.
ViTPose-S eliminates hand-crafted priors, demonstrating that vanilla transformer architectures can effectively capture spatial pose information.
The approach utilizes MAE pretraining and advanced fine-tuning protocols, ensuring scalability, flexibility, and high inference speed.

ViTPose-S denotes the small-scale configuration (“ViTPose-B”) of the ViTPose framework, a pure vision transformer (ViT)-based baseline designed for human pose estimation. Eschewing hand-crafted domain priors, ViTPose-S utilizes a plain, non-hierarchical vision transformer as its backbone and a lightweight, task-agnostic decoder. It achieves a strong balance between model efficiency, inference speed, and pose estimation accuracy, establishing a new Pareto front compared to prior approaches. The approach demonstrates the scalability, flexibility, and transferability of vanilla transformer architectures for pose estimation tasks on benchmark datasets (Xu et al., 2022).

1. Model Architecture and Configuration

ViTPose-S employs a standard ViT-B backbone without pose-specific inductive biases. The architecture is composed as follows:

Backbone: Plain Vision Transformer (ViT-B)
- Number of transformer layers: 12
- Token (hidden) dimension $C = 768$
- Multi-head self-attention (MHSA) heads: 12
- Feed-forward network (MLP) dimension: $4 \times C = 3072$
- Patch size: $16 \times 16$ (downsampling ratio $d=16$ )
- Total parameters: $\approx 86$ million
Decoder: Two-Step Lightweight Head
- Receives last transformer output $F_{\text{out}} \in \mathbb{R}^{H/d \times W/d \times C}$
- Two consecutive deconvolution blocks:
- Each: Transposed convolution (stride=2, kernel size 4), BatchNorm, ReLU
- Recovers spatial resolution: $H/d \rightarrow H/(d/2) \rightarrow H/(d/4) = H/4$
- Final $1 \times 1$ convolution maps to $N_k$ keypoint heatmaps ( $N_k = 17$ for COCO)
- Output: $K = \text{Conv}_{1\times 1}\left(\text{Deconv}\left(\text{Deconv}(F_{\text{out}})\right)\right)$ , $K \in \mathbb{R}^{H/4 \times W/4 \times N_k}$

All ViTPose models employ an identical decoder head; decoder simplicity highlights the backbone features’ strength. A further simplified decoder (single-step bilinear upscaling plus a $3 \times 3$ conv) yields less than $0.3$ Average Precision (AP) drop.

Component	Specification	Value
Transformer	Layers, Hidden, Heads, MLP	12, 768, 12, 3072
Patch Size	Downsampling	$16 \times 16$ $(d=16)$
Decoder	Deconv Blocks, Output	2, $N_k$ Heatmaps
Parameters	Model Size	$\sim$ 86M
Inference FPS	NVIDIA A100, $256\times192$	944

2. Training Regime and Optimization

ViTPose-S is pretrained and fine-tuned as follows:

Pretraining: Initialize transformer with Masked Autoencoder (MAE) weights pretrained on ImageNet-1K.
Finetuning and Task-Specific Training:
- Dataset: COCO Keypoints
- Input size: $256\times192$
- Optimizer: AdamW ( $\beta_1=0.9$ , $\beta_2=0.999$ , weight decay $0.1$)
- Learning rate: $5\times10^{-4}$ (base, with 10 $\times$ decay at epochs 170 and 200)
- Layer-wise learning-rate decay: $0.75$
- Stochastic drop-path rate: $0.30$
- Batch size: $512$
- Training epochs: $210$
- Data augmentation: Scale, rotation, flip (MMPose default)
- Postprocessing: UDP coordinate decoding

This setup preserves the generality of the transformer backbone and maximizes flexibility regarding pretraining sources and finetuning recipes.

3. Inference, Throughput, and Decoder Analysis

Evaluation is conducted with $256\times192$ inputs at inference and achieves approximately 944 frames-per-second on an NVIDIA A100 GPU, highlighting significant efficiency.

The two-block decoder is empirically justified:

Consistent performance across all ViTPose configurations (S, L, H, G)
“Single-step” decoder (bilinear upsampling + $3\times3$ conv): $<0.3$ AP drop, suggesting transformer-extracted features are inherently suitable for keypoint regression

Inference is performed without architecture specialization for model scale, further demonstrating the backbone’s adaptability.

4. Performance Benchmarks

On the MS COCO Human Keypoint Detection benchmark, ViTPose-S achieves the following results:

Metric	Value
AP	75.8
AP $_{50}$	90.7
AR	81.1
AR $_{50}$	94.6
Throughput	$\approx$ 944 fps

Comparative context:

SimpleBaseline (ResNet-152): 73.5 AP @ 829 fps
HRFormer-B (43M): 75.6 AP @ 158 fps

ViTPose-S outperforms or matches prior small- and efficient-model baselines in accuracy and speed (Xu et al., 2022). This establishes a new Pareto front for throughput and accuracy in the pose estimation literature for transformers.

5. Key Mechanisms and Equations

The model’s core computations:

Transformer Layer Update:

$F_{(i+1)}' = F_{i} + \text{MHSA}(\text{LN}(F_{i}))$

$F_{(i+1)} = F_{(i+1)}' + \text{FFN}(\text{LN}(F_{(i+1)}'))$

where $\text{MHSA}(Q, K, V) = \text{softmax}(QK^T / \sqrt{d})V$ , and $\text{FFN}(x) = W_2 \ \text{GELU}(W_1 x)$ .

Decoder Head (Two-stage):

$K = \text{Conv}_{1 \times 1}(\text{Deconv}(\text{Deconv}(F_\text{out})))$

Decoder Head (Single-step):

$K = \text{Conv}_{3 \times 3}(\text{Bilinear}(\text{ReLU}(F_\text{out})))$

Knowledge Token Distillation:
- Teacher $T$ , student $S$ , ground-truth $K_\text{gt}$ .

$t^* = \arg\min_{t} \| T([t;X]) - K_\text{gt} \|^2$

$L_{td} = \| S([t^*;X]) - K_\text{gt} \|^2$

Optionally:

$L_{tod} = \| S([t^*;X]) - K_t \|^2 + \| S([t^*;X]) - K_\text{gt} \|^2$

These formulations underscore the backbone’s modularity and the decoder’s minimalism.

6. Implications and Significance

ViTPose-S illustrates that a straightforward, unmodified ViT backbone suffices for state-of-the-art human pose estimation when paired with a generic, lightweight decoder and modern training protocols. The architecture demonstrates:

No domain-specific inductive bias is required for strong keypoint localization performance.
Scaling up or down preserves architectural regularity; the same backbone and head can address diverse pose estimation settings.
Pretrained transformer representations, particularly from self-supervised MAE on large datasets, transfer well to vision tasks requiring spatial detail.

A plausible implication is that future work in pose estimation can further leverage plain transformer backbones and focus optimization efforts on data, pretraining, and transfer strategies, rather than on architectural specialization (Xu et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ViTPose-S.