EViTPose: Efficient ViT 2D Pose Estimation
- The paper introduces a novel transformer-based architecture that integrates patch tokens with learnable joint tokens to enable selective patch processing.
- It achieves significant FLOPs reduction (30–44%) with minimal performance loss by dynamically pruning less salient patches based on joint-token attention.
- A unified skeletal representation across datasets supports robust cross-dataset generalization and offers flexible accuracy–efficiency trade-offs.
EViTPose is a ViT-based framework for efficient, accurate, and robust 2D human pose estimation that introduces learnable joint tokens for selective patch processing, enabling a substantial reduction in computational complexity while maintaining high accuracy. It provides a flexible architecture to trade off between inference cost and performance and supports cross-dataset generalization through a unified skeletal representation. EViTPose advances pose estimation by leveraging transformer-based global context along with tailored attention and decoding strategies to outperform or match leading alternatives across multiple human pose benchmarks (Kinfu et al., 28 Feb 2025).
1. Architectural Foundations
EViTPose builds upon the Vision Transformer (ViT) paradigm, diverging from conventional full-attention token processing by incorporating a dual-token architecture:
- Patch Tokens: An input image is divided into non-overlapping patches, each flattened and projected to a C-dimensional embedding, with positional encodings added as in standard ViT.
- Joint Tokens: For each target anatomical joint, a learnable token is randomly initialized and trained alongside the network. This design explicitly encodes pose-structural priors, as each joint token aggregates features and attention pertaining solely to its corresponding landmark.
The transformer encoder is applied to the concatenated set , with each block comprising multi-head self-attention (MSA) and a two-layer MLP with GELU activation. Decoding is conducted via two heads: a heatmap decoder (small CNN, outputting per-joint Gaussian maps) and a keypoint regressor (LayerNorm plus fully connected layer) mapping each joint token to keypoint coordinates.
2. Joint Token Selective Patch Processing
The central innovation in EViTPose is selective patch processing based on joint-token attention. Rather than propagate all N spatial patches through every transformer block ( per block), EViTPose dynamically selects the most salient L patches () at each layer, guided by joint-token attention. The patch selection mechanism operates as follows:
- At each transformer block (except the last), the full attention matrix is computed.
- For each patch , the mean attention from all joint tokens is aggregated:
- The importance score 0 for each patch incorporates both attention and patch feature norm:
1
- The top-L patches with highest 2 are retained for the next block; others are pruned but refined using cross-attention with joint tokens for use in the heatmap decoder.
This process is repeated blockwise, with the final transformer block restoring the full set of tokens for maximum accuracy.
| Variant | Patch Selection Method | GFLOPs | COCO mAP |
|---|---|---|---|
| EViTPose-B/JT | Joint tokens (default) | 13.7 | 76.5 |
| EViTPose-B/S | Skeleton intersections | 13.3 | 75.0 |
| EViTPose-B/N | Neighbor-based | 11.1 | 74.1 |
| EViTPose-B (full) | None (all patches) | 19.8 | 77.6 |
When L ≪ N and J ≪ N, complexity reduces to 3 per block, achieving 30–44% FLOPs reduction with ≤3.5 mAP penalty across leading benchmarks.
3. Unified Skeleton Representation and Cross-Dataset Training
EViTPose adopts a unified skeletal representation, aggregating all unique joint labels across multiple datasets (COCO, MPII, AI-Challenger, JRDB, CrowdPose), resulting in J ≈ 20–21 joints. The model is trained with per-joint weighted losses to accommodate missing annotations:
- Regression loss per joint:
4
- Heatmap loss:
5
- Total loss:
6
where 7 if the joint is not labeled in the dataset.
Unified training facilitates robustness to occlusion, scale, and illumination variation, yielding improved cross-benchmark performance. For example, unified training boosts COCO mAP from 76.1 to 78.0.
4. Efficiency–Accuracy Trade-Offs and Comparative Analysis
EViTPose offers explicit control over the accuracy–efficiency frontier via the L parameter. Main metrics include:
- On MS-COCO (256×192): Full EViTPose-B achieves 77.6 mAP at 19.8 GFLOPs; pruning to 80 patches (EViTPose-B/JT) reduces cost to 13.7 GFLOPs with only 1.1 mAP loss.
- On OCHuman: 19.8→13.7 GFLOPs (–31%) at 93.0→92.3 mAP (–0.7).
- Varying L from 50 to 150 yields a monotonic trade-off curve for mAP vs. FLOPs.
Decoder choices additionally affect speed–accuracy tradeoff:
- Heatmap CNN decoder achieves 78.0 AP at 18.1 GFLOPs (COCO).
- Pixel-shuffle decoder produces 76.2 AP at 10.0 GFLOPs.
5. Experimental Benchmarks and Ablations
EViTPose performance was established on six human pose estimation benchmarks, with direct comparisons to competitive methods:
| Model | GFLOPs | Params | COCO AP | MPII PCKh | JRDB AP | Crowd AP | OCH AP |
|---|---|---|---|---|---|---|---|
| HRNet-W48 | 14.6 | 64 M | 75.1 | 90.1 | 42.4 | – | – |
| ViTPose-B | 18.0 | 90 M | 77.1 | 93.3 | – | 32.0 | 87.3 |
| EViTPose-B | 19.8 | 90 M | 77.6 | 92.4 | 73.9 | 36.6 | 93.0 |
| EViTPose-B/JT | 13.7 | 90 M | 76.5 | 92.5 | 73.9 | 35.0 | 92.3 |
Comparison with alternative selection strategies confirmed joint-token selection as optimal for accuracy/FLOPs trade-off. Ablation on local–global attention (used in UniTransPose) and decoder types further delineated the functional performance boundaries.
6. Comparison to Related Architectures
While EViTPose targets patch selection and computational scaling, UniTransPose—introduced in the same work—employs Joint-Aware Global-Local (JAGL) attention to capture local and global context at multiple scales. JAGL divides attention heads between local stripe-wise and global joint-based mechanisms, yielding 8 cost per block, versus EViTPose's adaptive patch pruning. JAGL achieves higher accuracy (78.0 AP on COCO) but offers less direct control over cost-performance. EViTPose’s explicit selection process differentiates it from prior ViT-based pose estimators, which typically do not exploit learnable joint tokens for active patch selection (Kinfu et al., 28 Feb 2025).
7. Limitations and Prospective Extensions
EViTPose is designed for static frame-based pose estimation and does not, in its default configuration, model temporal or video-based context. The approach assumes annotated joint locations are available for training and relies on the alignment of joint labels across datasets, with per-joint weights mitigating mismatches. A plausible implication is that future extensions could generalize the joint-token mechanism to temporal transformers or video-based architectures, leveraging sequential dependencies for further accuracy or robustness gains.
The methodology opens opportunities for broader applications in pose estimation domains where computational efficiency and patch-level interpretability are paramount and establishes a template for learnable, dynamic attention targeting anatomically meaningful regions in transformer networks for structured prediction tasks (Kinfu et al., 28 Feb 2025).