Papers
Topics
Authors
Recent
Search
2000 character limit reached

EViTPose: Efficient ViT 2D Pose Estimation

Updated 11 May 2026
  • The paper introduces a novel transformer-based architecture that integrates patch tokens with learnable joint tokens to enable selective patch processing.
  • It achieves significant FLOPs reduction (30–44%) with minimal performance loss by dynamically pruning less salient patches based on joint-token attention.
  • A unified skeletal representation across datasets supports robust cross-dataset generalization and offers flexible accuracy–efficiency trade-offs.

EViTPose is a ViT-based framework for efficient, accurate, and robust 2D human pose estimation that introduces learnable joint tokens for selective patch processing, enabling a substantial reduction in computational complexity while maintaining high accuracy. It provides a flexible architecture to trade off between inference cost and performance and supports cross-dataset generalization through a unified skeletal representation. EViTPose advances pose estimation by leveraging transformer-based global context along with tailored attention and decoding strategies to outperform or match leading alternatives across multiple human pose benchmarks (Kinfu et al., 28 Feb 2025).

1. Architectural Foundations

EViTPose builds upon the Vision Transformer (ViT) paradigm, diverging from conventional full-attention token processing by incorporating a dual-token architecture:

  • Patch Tokens: An input image X∈RH×W×3X \in \mathbb{R}^{H \times W \times 3} is divided into non-overlapping 16×1616 \times 16 patches, each flattened and projected to a C-dimensional embedding, with positional encodings added as in standard ViT.
  • Joint Tokens: For each target anatomical joint, a learnable token (J∈RJ×C)(J \in \mathbb{R}^{J \times C}) is randomly initialized and trained alongside the network. This design explicitly encodes pose-structural priors, as each joint token aggregates features and attention pertaining solely to its corresponding landmark.

The transformer encoder is applied to the concatenated set [P;J][\mathrm{P}; \mathrm{J}], with each block comprising multi-head self-attention (MSA) and a two-layer MLP with GELU activation. Decoding is conducted via two heads: a heatmap decoder (small CNN, outputting per-joint Gaussian maps) and a keypoint regressor (LayerNorm plus fully connected layer) mapping each joint token to (xj,yj)(x_j, y_j) keypoint coordinates.

2. Joint Token Selective Patch Processing

The central innovation in EViTPose is selective patch processing based on joint-token attention. Rather than propagate all N spatial patches through every transformer block (O((N+J)2)O((N+J)^2) per block), EViTPose dynamically selects the most salient L patches (L≪NL \ll N) at each layer, guided by joint-token attention. The patch selection mechanism operates as follows:

  • At each transformer block (except the last), the full attention matrix A∈R(N+J)×(N+J)A \in \mathbb{R}^{(N+J) \times (N+J)} is computed.
  • For each patch ll, the mean attention from all joint tokens is aggregated:

Wl=1J∑j=1JAN+j, lW_l = \frac{1}{J} \sum_{j=1}^J A_{N+j,\,l}

  • The importance score 16×1616 \times 160 for each patch incorporates both attention and patch feature norm:

16×1616 \times 161

  • The top-L patches with highest 16×1616 \times 162 are retained for the next block; others are pruned but refined using cross-attention with joint tokens for use in the heatmap decoder.

This process is repeated blockwise, with the final transformer block restoring the full set of tokens for maximum accuracy.

Variant Patch Selection Method GFLOPs COCO mAP
EViTPose-B/JT Joint tokens (default) 13.7 76.5
EViTPose-B/S Skeleton intersections 13.3 75.0
EViTPose-B/N Neighbor-based 11.1 74.1
EViTPose-B (full) None (all patches) 19.8 77.6

When L ≪ N and J ≪ N, complexity reduces to 16×1616 \times 163 per block, achieving 30–44% FLOPs reduction with ≤3.5 mAP penalty across leading benchmarks.

3. Unified Skeleton Representation and Cross-Dataset Training

EViTPose adopts a unified skeletal representation, aggregating all unique joint labels across multiple datasets (COCO, MPII, AI-Challenger, JRDB, CrowdPose), resulting in J ≈ 20–21 joints. The model is trained with per-joint weighted losses to accommodate missing annotations:

  • Regression loss per joint:

16×1616 \times 164

  • Heatmap loss:

16×1616 \times 165

  • Total loss:

16×1616 \times 166

where 16×1616 \times 167 if the joint is not labeled in the dataset.

Unified training facilitates robustness to occlusion, scale, and illumination variation, yielding improved cross-benchmark performance. For example, unified training boosts COCO mAP from 76.1 to 78.0.

4. Efficiency–Accuracy Trade-Offs and Comparative Analysis

EViTPose offers explicit control over the accuracy–efficiency frontier via the L parameter. Main metrics include:

  • On MS-COCO (256×192): Full EViTPose-B achieves 77.6 mAP at 19.8 GFLOPs; pruning to 80 patches (EViTPose-B/JT) reduces cost to 13.7 GFLOPs with only 1.1 mAP loss.
  • On OCHuman: 19.8→13.7 GFLOPs (–31%) at 93.0→92.3 mAP (–0.7).
  • Varying L from 50 to 150 yields a monotonic trade-off curve for mAP vs. FLOPs.

Decoder choices additionally affect speed–accuracy tradeoff:

  • Heatmap CNN decoder achieves 78.0 AP at 18.1 GFLOPs (COCO).
  • Pixel-shuffle decoder produces 76.2 AP at 10.0 GFLOPs.

5. Experimental Benchmarks and Ablations

EViTPose performance was established on six human pose estimation benchmarks, with direct comparisons to competitive methods:

Model GFLOPs Params COCO AP MPII PCKh JRDB AP Crowd AP OCH AP
HRNet-W48 14.6 64 M 75.1 90.1 42.4 – –
ViTPose-B 18.0 90 M 77.1 93.3 – 32.0 87.3
EViTPose-B 19.8 90 M 77.6 92.4 73.9 36.6 93.0
EViTPose-B/JT 13.7 90 M 76.5 92.5 73.9 35.0 92.3

Comparison with alternative selection strategies confirmed joint-token selection as optimal for accuracy/FLOPs trade-off. Ablation on local–global attention (used in UniTransPose) and decoder types further delineated the functional performance boundaries.

While EViTPose targets patch selection and computational scaling, UniTransPose—introduced in the same work—employs Joint-Aware Global-Local (JAGL) attention to capture local and global context at multiple scales. JAGL divides attention heads between local stripe-wise and global joint-based mechanisms, yielding 16×1616 \times 168 cost per block, versus EViTPose's adaptive patch pruning. JAGL achieves higher accuracy (78.0 AP on COCO) but offers less direct control over cost-performance. EViTPose’s explicit selection process differentiates it from prior ViT-based pose estimators, which typically do not exploit learnable joint tokens for active patch selection (Kinfu et al., 28 Feb 2025).

7. Limitations and Prospective Extensions

EViTPose is designed for static frame-based pose estimation and does not, in its default configuration, model temporal or video-based context. The approach assumes annotated joint locations are available for training and relies on the alignment of joint labels across datasets, with per-joint weights mitigating mismatches. A plausible implication is that future extensions could generalize the joint-token mechanism to temporal transformers or video-based architectures, leveraging sequential dependencies for further accuracy or robustness gains.

The methodology opens opportunities for broader applications in pose estimation domains where computational efficiency and patch-level interpretability are paramount and establishes a template for learnable, dynamic attention targeting anatomically meaningful regions in transformer networks for structured prediction tasks (Kinfu et al., 28 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EViTPose.