Category-Agnostic Pose Estimation (CAPE)

Updated 24 November 2025

CAPE is a pose estimation paradigm that predicts semantic keypoints for objects from unseen categories using minimal support information.
It employs advanced architectures such as transformer-based attention, graph-structured priors, and meta-point proposals for robust keypoint localization.
Recent approaches integrate text-based support and multimodal reasoning to generalize efficiently across diverse, open-world visual scenarios.

Category-Agnostic Pose Estimation (CAPE) is a research paradigm within computer vision that addresses the problem of keypoint localization for objects from arbitrary, potentially unseen, categories. The task is defined by the ability to predict a set of semantic keypoints on a query image using minimal support information, where the set of keypoints and object category can vary episode by episode. A CAPE model must generalize across highly diverse visual instances and keypoint definitions, eschewing reliance on category-specific models, category labels, or extensive retraining, thus enabling pose estimation in genuinely open-world scenarios.

1. Problem Definition and Foundations

Category-Agnostic Pose Estimation is formalized as the following: Given a query image $I_q$ of an object belonging to some class $c$ , together with a support specification—classically, $M$ annotated support images $S = \{(I_s^i, H_s^i)\}_{i=1}^M$ with $K$ keypoints each, or more recently, text-based descriptions of the keypoints—predict the 2D (or, in some variants, 3D) coordinates $P \in \mathbb{R}^{K \times 2}$ of those same $K$ semantic keypoints in the query image. A CAPE method must (i) support arbitrary keypoint sets per episode and (ii) enable generalization to novel categories without category-specific fine-tuning (Xu et al., 2022).

CAPE expands on traditional pose estimation, which is category- (e.g., human body, animal) or instance-specific, by enforcing an open-world protocol: disjoint sets of categories are used for training and testing, and evaluation focuses on the ability to handle objects and part definitions never seen during model development. The canonical MP-100 dataset (Xu et al., 2022), containing 100 diverse categories and up to 68 keypoints per instance, is widely used as the benchmark for 2D CAPE.

Key performance metrics include Probability of Correct Keypoint (PCK) at a normalized threshold, reflecting a pose prediction’s spatial accuracy relative to the object’s scale.

2. Model Architectures and Methodological Advances

CAPE methods have evolved across several axes:

Feature Matching Baselines: Early approaches such as POMNet (Xu et al., 2022) pose CAPE as a feature-matching problem. Visual features are extracted at annotated support keypoints and compared across the query image, with a transformer-based Keypoint Interaction Module enabling support–query alignment and inter-keypoint reasoning.
Self-Attention and End-to-End Regression: SCAPE (Liang et al., 2024) dispenses with explicit similarity heads and heatmap supervision, instead using pure multi-head self-attention (transformer interactors) followed by direct regression of keypoint coordinates through an MLP. This single-stage approach improves both speed and parameter efficiency, with accuracy gains attributed to better attention quality and global feature integration.
Graph-Based Structural Modeling: Recent works (GraphCape (Hirschorn et al., 2023), EdgeCape (Hirschorn et al., 2024)) treat the keypoints as nodes in a pose graph, with edges reflecting semantic or anatomical relations. Graph-based networks inject explicit structural priors, propagating information from visible to occluded parts and breaking symmetry for ambiguous instances. EdgeCape generalizes the standard pose graph by learning instance-specific edge weights (adjacency refinement) and integrating Markovian structural bias into decoder-layer self-attention, further enhancing symmetry-breaking and global context reasoning.
Meta-Point Proposal and Assignment: MetaPoint (Chen et al., 2024) introduces learnable, support-free meta-point proposals capturing universal part priors, which are subsequently assigned and refined to match user-specified keypoints through deformable point decoders.
Text-Based Support and Multimodal Integration: CapeX (Rusanovsky et al., 2024), CapeLLM (Kim et al., 2024), and CapeNext (Zhu et al., 17 Nov 2025) replace visual support with textual descriptions of each keypoint (and, in CapeNext, class-level and image-level context). These frameworks embed language descriptions and employ cross-modal fusion (including CLIP-based encoders and LLM reasoning) to localize category-agnostic keypoints, achieving robustness to visual variability and further decoupling pose estimation from visual support acquisition.

The following table summarizes representative model families:

Approach	Support Input	Key Architectural Innovation
POMNet	Support images	Transformer KIM, feature match
SCAPE	Support images	Pure self-attn, global context
GraphCape	Support images	Graph-FFN, skeleton prior
EdgeCape	Support images	Edge weight/Markov bias
MetaPoint	Support images	Meta-point proposals, deformable decoders
CapeX	Text (keypoint desc)	Text encoder + graph-aware dec.
CapeLLM	Text (rich desc.)	Multimodal LLM reasoning
CapeNext	Text + image/class	HCMI, DSFR dynamic fusion

3. Key Technical Mechanisms

Feature Extraction and Attention

Most CAPE pipelines rely on deep visual backbones (ResNet, ViT, Swin, DINOv2), with transformer-based attention mechanisms aligning support and query features. In SCAPE (Liang et al., 2024), initial self-attention layers are replaced with cross-attention to augment keypoint tokens with global support image semantics (Global Keypoint Feature Perceptor, GKP), while Keypoint Attention Refiner (KAR) filters noisy attention patterns between tokens.

Graph-Based Priors

Graph-FFN modules (Hirschorn et al., 2023) update keypoint nodes using adjacency matrices reflecting category-specific skeletons, with normalized message passing enabling structural bias and occlusion robustness. EdgeCape (Hirschorn et al., 2024) generalizes this by learning residual adjacency (edge weight prediction) and incorporating Markovian bias to modulate attention according to $k$ -hop graph distances. This structure enables robust symmetry-breaking and information flow in ambiguous or occluded cases.

Meta-Point and Deformable Decoders

MetaPoint (Chen et al., 2024) dispenses with local support cues by learning universal meta-embeddings, which are refined into sparse meta-points through multi-scale deformable attention. Assignment to user-desired keypoints employs slacked regression and bipartite matching, and a separate deformable decoder produces precise localization.

Textual Conditioning and Multimodal Reasoning

Recent advances leverage language representations (CLIP, GTE, LLaMA) to encode keypoint semantics, fusing these with visual features via multi-modal transformers or LLMs (CapeLLM (Kim et al., 2024)). These models support support-free inference and provide increased robustness to occlusion, visual diversity, and semantic ambiguity.

Recurrent and Structure-Aware Feature Mining

FMMP (Chen et al., 27 Mar 2025) introduces recurrent deformable attention modules for fine-grained feature mining, structuring support and query estimation with skeleton-guided offsets and mixup keypoint padding to supply denser, structure-aware supervision across episodes with varying keypoint numbers.

4. Training Regimes, Losses, and Datasets

CAPE models are trained episodically, mirroring few-shot protocols: in each episode, support and query items are sampled from disjoint categories. Losses are dominated by normalized $\ell_1$ coordinate regression on predicted keypoints, often supplemented by heatmap losses and auxiliary attention or adjacency supervision (as in EdgeCape (Hirschorn et al., 2024)):

$\mathcal{L}_\text{total} = \mathcal{L}_\text{offset} + \lambda_\text{adj} \mathcal{L}_\text{adj} + \cdots$

For text-based models, cross-entropy is used to supervise coordinate token generation.

The MP-100 benchmark (Xu et al., 2022) is central for evaluation, with rigorous category splits to enforce generalization. Datasets span up to 20,000 images across 100 categories, with 8–68 keypoints per category and explicit skeleton graphs.

5. Quantitative Results and Comparative Insights

Performance is reported in [email protected] or mPCK across all unseen categories. Key results on MP-100 ( $K \in [8,68]$ ):

SCAPE (Liang et al., 2024): 1-shot/5-shot 87.6%/90.7% PCK (ResNet50); 91.9%/94.0% (ViT-B DINOv2)
EdgeCape (Hirschorn et al., 2024): 1-shot 89.01%, 5-shot 92.21% (DINOv2 backbone, edge-weighted graph, Markov bias)
GraphCape (Hirschorn et al., 2023): 1-shot 87.23%, 5-shot 91.16%
MetaPoint+ (Chen et al., 2024): 1-shot/5-shot 86.2% / 90.0% (deformable meta-point)
FMMP (Chen et al., 27 Mar 2025): mPCK 73.42% (+3.2% [email protected] over SCAPE baseline)
CapeLLM (Kim et al., 2024): 1-shot [email protected] 92.6% (support-free, LLM-based)
CapeX (Rusanovsky et al., 2024): 1-shot 91.5% (support-free, text-graph)
CapeNext (Zhu et al., 17 Nov 2025): 1-shot 88.37% (+0.76 over CapeX on SwinV2-T/ViT-B-32).

State-of-the-art models demonstrate robust improvements from graph-structural priors, Markovian bias, text-based lessions, and dynamic cross-modal support (Hirschorn et al., 2024, Kim et al., 2024, Zhu et al., 17 Nov 2025).

6. Frontier Directions, Practical Considerations, and Open Challenges

Efficiency and Scalability: End-to-end, single-stage regression (SCAPE) and lightweight graph modules provide high FPS with low parameter counts. Mixup keypoint padding and recurrent feature mining offer dense supervision and structural adaptation (Chen et al., 27 Mar 2025).
Support-Free Inference: Textual or multimodal support (CapeX, CapeLLM, CapeNext) eliminates the need for curated support images and enables broader applicability in dynamic, real-world settings.
Generalization and Robustness: Graph/edge-weight learning, cross-modal refinement, and meta-point priors enhance symmetry reasoning, occlusion handling, and consistency across rare and diverse object shapes.
Limitations: Models remain susceptible to polysemy and subtle visual variations absent from training, especially under completely open-vocabulary or 3D extension scenarios.
Future Work: Open avenues include joint 3D/2D reasoning, multi-instance/multi-object scenes, improved structure learning under weak supervision or open-vocabulary semantics, and further synergy with large language/vision models for richer multimodal priors (Zhu et al., 17 Nov 2025, Liang et al., 2024). Benchmarking on real-time inference, domain transfer, and novel part specifications continues to drive progress.

Category-Agnostic Pose Estimation has become a core challenge at the intersection of representation learning, vision–language modeling, and open-world visual understanding, enabling generalized keypoint localization with increasing accuracy, flexibility, and interpretability. For comprehensive technical details and latest comparisons, see (Liang et al., 2024, Hirschorn et al., 2024, Kim et al., 2024, Zhu et al., 17 Nov 2025, Chen et al., 2024, Xu et al., 2022).