Papers
Topics
Authors
Recent
Search
2000 character limit reached

ViTPose-S: Efficient Pose Estimation

Updated 27 January 2026
  • The paper introduces ViTPose-S, a model using a plain ViT backbone and lightweight decoder to achieve 75.8 AP with 944 fps on COCO.
  • ViTPose-S eliminates hand-crafted priors, demonstrating that vanilla transformer architectures can effectively capture spatial pose information.
  • The approach utilizes MAE pretraining and advanced fine-tuning protocols, ensuring scalability, flexibility, and high inference speed.

ViTPose-S denotes the small-scale configuration (“ViTPose-B”) of the ViTPose framework, a pure vision transformer (ViT)-based baseline designed for human pose estimation. Eschewing hand-crafted domain priors, ViTPose-S utilizes a plain, non-hierarchical vision transformer as its backbone and a lightweight, task-agnostic decoder. It achieves a strong balance between model efficiency, inference speed, and pose estimation accuracy, establishing a new Pareto front compared to prior approaches. The approach demonstrates the scalability, flexibility, and transferability of vanilla transformer architectures for pose estimation tasks on benchmark datasets (Xu et al., 2022).

1. Model Architecture and Configuration

ViTPose-S employs a standard ViT-B backbone without pose-specific inductive biases. The architecture is composed as follows:

  • Backbone: Plain Vision Transformer (ViT-B)
    • Number of transformer layers: 12
    • Token (hidden) dimension C=768C = 768
    • Multi-head self-attention (MHSA) heads: 12
    • Feed-forward network (MLP) dimension: 4×C=30724 \times C = 3072
    • Patch size: 16×1616 \times 16 (downsampling ratio d=16d=16)
    • Total parameters: 86\approx 86 million
  • Decoder: Two-Step Lightweight Head
    • Receives last transformer output FoutRH/d×W/d×CF_{\text{out}} \in \mathbb{R}^{H/d \times W/d \times C}
    • Two consecutive deconvolution blocks:
    • Each: Transposed convolution (stride=2, kernel size 4), BatchNorm, ReLU
    • Recovers spatial resolution: H/dH/(d/2)H/(d/4)=H/4H/d \rightarrow H/(d/2) \rightarrow H/(d/4) = H/4
    • Final 1×11 \times 1 convolution maps to NkN_k keypoint heatmaps (Nk=17N_k = 17 for COCO)
    • Output: K=Conv1×1(Deconv(Deconv(Fout)))K = \text{Conv}_{1\times 1}\left(\text{Deconv}\left(\text{Deconv}(F_{\text{out}})\right)\right), KRH/4×W/4×NkK \in \mathbb{R}^{H/4 \times W/4 \times N_k}

All ViTPose models employ an identical decoder head; decoder simplicity highlights the backbone features’ strength. A further simplified decoder (single-step bilinear upscaling plus a 3×33 \times 3 conv) yields less than $0.3$ Average Precision (AP) drop.

Component Specification Value
Transformer Layers, Hidden, Heads, MLP 12, 768, 12, 3072
Patch Size Downsampling 16×1616 \times 16 (d=16)(d=16)
Decoder Deconv Blocks, Output 2, NkN_k Heatmaps
Parameters Model Size \sim86M
Inference FPS NVIDIA A100, 256×192256\times192 944

2. Training Regime and Optimization

ViTPose-S is pretrained and fine-tuned as follows:

  • Pretraining: Initialize transformer with Masked Autoencoder (MAE) weights pretrained on ImageNet-1K.
  • Finetuning and Task-Specific Training:
    • Dataset: COCO Keypoints
    • Input size: 256×192256\times192
    • Optimizer: AdamW (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, weight decay $0.1$)
    • Learning rate: 5×1045\times10^{-4} (base, with 10×\times decay at epochs 170 and 200)
    • Layer-wise learning-rate decay: $0.75$
    • Stochastic drop-path rate: $0.30$
    • Batch size: $512$
    • Training epochs: $210$
    • Data augmentation: Scale, rotation, flip (MMPose default)
    • Postprocessing: UDP coordinate decoding

This setup preserves the generality of the transformer backbone and maximizes flexibility regarding pretraining sources and finetuning recipes.

3. Inference, Throughput, and Decoder Analysis

Evaluation is conducted with 256×192256\times192 inputs at inference and achieves approximately 944 frames-per-second on an NVIDIA A100 GPU, highlighting significant efficiency.

The two-block decoder is empirically justified:

  • Consistent performance across all ViTPose configurations (S, L, H, G)
  • “Single-step” decoder (bilinear upsampling + 3×33\times3 conv): <0.3<0.3 AP drop, suggesting transformer-extracted features are inherently suitable for keypoint regression

Inference is performed without architecture specialization for model scale, further demonstrating the backbone’s adaptability.

4. Performance Benchmarks

On the MS COCO Human Keypoint Detection benchmark, ViTPose-S achieves the following results:

Metric Value
AP 75.8
AP50_{50} 90.7
AR 81.1
AR50_{50} 94.6
Throughput \approx944 fps

Comparative context:

  • SimpleBaseline (ResNet-152): 73.5 AP @ 829 fps
  • HRFormer-B (43M): 75.6 AP @ 158 fps

ViTPose-S outperforms or matches prior small- and efficient-model baselines in accuracy and speed (Xu et al., 2022). This establishes a new Pareto front for throughput and accuracy in the pose estimation literature for transformers.

5. Key Mechanisms and Equations

The model’s core computations:

  • Transformer Layer Update:

F(i+1)=Fi+MHSA(LN(Fi))F_{(i+1)}' = F_{i} + \text{MHSA}(\text{LN}(F_{i}))

F(i+1)=F(i+1)+FFN(LN(F(i+1)))F_{(i+1)} = F_{(i+1)}' + \text{FFN}(\text{LN}(F_{(i+1)}'))

where MHSA(Q,K,V)=softmax(QKT/d)V\text{MHSA}(Q, K, V) = \text{softmax}(QK^T / \sqrt{d})V, and FFN(x)=W2 GELU(W1x)\text{FFN}(x) = W_2 \ \text{GELU}(W_1 x).

  • Decoder Head (Two-stage):

K=Conv1×1(Deconv(Deconv(Fout)))K = \text{Conv}_{1 \times 1}(\text{Deconv}(\text{Deconv}(F_\text{out})))

  • Decoder Head (Single-step):

K=Conv3×3(Bilinear(ReLU(Fout)))K = \text{Conv}_{3 \times 3}(\text{Bilinear}(\text{ReLU}(F_\text{out})))

  • Knowledge Token Distillation:
    • Teacher TT, student SS, ground-truth KgtK_\text{gt}.

t=argmintT([t;X])Kgt2t^* = \arg\min_{t} \| T([t;X]) - K_\text{gt} \|^2

Ltd=S([t;X])Kgt2L_{td} = \| S([t^*;X]) - K_\text{gt} \|^2

  • Optionally:

Ltod=S([t;X])Kt2+S([t;X])Kgt2L_{tod} = \| S([t^*;X]) - K_t \|^2 + \| S([t^*;X]) - K_\text{gt} \|^2

These formulations underscore the backbone’s modularity and the decoder’s minimalism.

6. Implications and Significance

ViTPose-S illustrates that a straightforward, unmodified ViT backbone suffices for state-of-the-art human pose estimation when paired with a generic, lightweight decoder and modern training protocols. The architecture demonstrates:

  • No domain-specific inductive bias is required for strong keypoint localization performance.
  • Scaling up or down preserves architectural regularity; the same backbone and head can address diverse pose estimation settings.
  • Pretrained transformer representations, particularly from self-supervised MAE on large datasets, transfer well to vision tasks requiring spatial detail.

A plausible implication is that future work in pose estimation can further leverage plain transformer backbones and focus optimization efforts on data, pretraining, and transfer strategies, rather than on architectural specialization (Xu et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ViTPose-S.