Skeleton-Agnostic Representations

Updated 23 June 2026

Skeleton-agnostic representations are structured frameworks that abstract varied skeletal topologies, sensor modalities, and species differences for universal motion analysis.
They employ techniques like prompt-based encoding, graph tokenization, image-based rendering, and self-supervised learning to decouple motion features from fixed skeletal templates.
Empirical results in action recognition, motion transfer, and 3D reconstruction highlight improved cross-dataset robustness, though challenges remain in fine-grained motion detail and computational efficiency.

Skeleton-agnostic representations are structured approaches and embedding schemes designed to be invariant to the underlying skeletal topology, joint count, data modality, or pose definition across datasets, sensors, or species. They aim to enable universal analysis, recognition, synthesis, and transfer of articulated motion or action, circumventing the traditional requirement that models be trained and evaluated on a fixed, format-specific skeleton structure. Research in this domain spans action recognition, animation, motion understanding, and 3D reconstruction, and leverages advancements in deep learning, graph processing, self-supervision, and differentiable rendering.

1. The Need for Skeleton-Agnostic Representations

Diversity in skeletal data arises from multiple sources: heterogeneous sensor designs (e.g., Kinect v1/2 with 20/25 joints, motion capture rigs, BVH-based skeletons, animal morphologies), varying coordinate definitions (2D/3D), different joint connectivity, and extraction errors. Traditional methods assume a fixed skeleton format—their GCNs, transformers, or graph-based backbones encode explicitly the node topology and edge structure, hampering cross-dataset transfer, species generalization, and integration of novel data sources. Skeleton-agnostic representations are motivated by the demand for:

Robust action recognition across datasets, or in few-shot/zero-shot settings where the skeletal template may be mismatched (Wang et al., 4 Jun 2025, Xu et al., 6 Feb 2026).
Multi-modal applications involving sensors such as IMU, WiFi, or even textual instruction (Li et al., 17 Mar 2025).
Animation and motion retargeting across species, morphologies, or surface types (Xu et al., 6 Feb 2026, Wang et al., 9 Apr 2026).
Universal self-supervised or multi-modal learning on skeleton-like data without explicit architectural adaptation (Yang et al., 6 Mar 2026, Wang et al., 18 Mar 2026).
3D reconstruction of arbitrary articulated objects without predefined skeletal priors (Zhang et al., 2024).

2. Approaches to Skeleton-Agnostic Representation

Multiple principled frameworks have been introduced to support invariance to skeleton structure. The main strategies are:

2.1. Prompt-Based and Unified Topology Encoding

A dominant approach is to map all skeletons into a shared format with a "master topology" of maximal joint count. Inputs with fewer joints are filled using optimized "prompt" tokens. For example, (Wang et al., 4 Jun 2025) pads each input to 30 joints using learned skeleton-specific prompts, followed by spatial normalization and a common embedding backbone. Trainable prompts adaptively capture missing or non-existent degrees of freedom, outperforming naive zero-padding and facilitating a single model to process arbitrary $J_i \times D_i$ input skeletons.

2.2. Explicit Topology-Invariance via Tokenization and Graph Priors

NECromancer (Xu et al., 6 Feb 2026) introduces two key elements: (i) an Ontology-aware Skeletal Graph Encoder (OwO) that computes per-joint embeddings conditioned on textual semantics, rest-pose offsets, and arbitrary skeletal graphs, and (ii) a Topology-Agnostic Tokenizer (TAT) that aggregates all joint and edge features, injects a virtual "CLS"-style token, and applies residual vector quantization only on the topology-independent token. Decoding onto any target skeleton uses only per-format structural priors, fully decoupling moment-to-moment motion embedding from the native skeletal graph.

2.3. Image-Based and Differentiable Rendering Strategies

Skeleton-to-image (S2I) (Yang et al., 6 Mar 2026) and DrAction (Wang et al., 18 Mar 2026) transform the structured (T $\times$ J $\times$ D) skeleton sequence into a 2D grid (or a sequence of rendered images using learned, format-invariant Gaussians), allowing direct ingestion by vision transformers or MLLMs. S2I uses semantic body-part reordering, temporal-joint "flattening," normalization, and resizing to standard image dimensions. DrAction adapts the number and connectivity of 3D Gaussians per input skeleton, projects them differentiably through learned camera and rendering stages, and enables end-to-end task-driven optimization regardless of source topology, marker count, or modality (2D/3D/rich MoCap).

2.4. Self-Supervised Skeleton-Invariant Metric Learning

SKELAR (Li et al., 17 Mar 2025) pretrains representations through a self-supervised coarse-angle reconstruction objective. By predicting discretized, quantized joint rotation angles under random joint dropout, it inherently enforces invariance to subject, viewpoint, scale, and skeleton definition. Resulting per-action embeddings are highly robust, enabling domain transfer and direct matching with signals from heterogeneous HAR modalities (IMU, WiFi), even in the absence of skeletons, by leveraging synthetic data.

2.5. Implicit, Physics-Driven Structure Discovery

In category-agnostic articulated object reconstruction, strategies such as LIMR (Zhang et al., 2024) forgo category templates entirely, instead learning the skeleton graph, skinning weights, rigidity coefficients, and per-bone transforms via an EM-style iterative optimization. Regularization via ARAP-like penalties and motion cues (e.g., optical flow) guide the emergence of topology, maintaining the ability to generalize across arbitrary articulated classes.

2.6. Gaussian Splatting and Skeletonization

Gaussianimate/Skelebones (Wang et al., 9 Apr 2026) leverages 3D Gaussian Splatting to represent deformable surfaces, followed by motion-based clustering and Smooth Skinning Decomposition with Rigid Bones (SSDR). Skeletons are extracted via mean curvature skeletonization and gradient analysis of skinning weights, yielding kinematic trees in a topology-correct and motion-adaptive fashion. This enabling technology supports skeletonization across arbitrary animatable categories, with no species/model priors.

3. Architecture and Training Protocols

Skeleton-agnostic models integrate several standardized architectural choices:

Unified embedding backbones: Shared transformers with spatial and temporal self-attention (e.g., as in (Wang et al., 4 Jun 2025)) process format-invariant input tensors, fed by joint embeddings, semantic cues, and trainable prompts.
Self-supervised objectives: Consistency losses (cross-format, cross-modal), variance-covariance regularization (VICReg) (Wang et al., 4 Jun 2025), masked modeling (MAE/DiffMAE) (Yang et al., 6 Mar 2026), contrastive semantics (CLIP-space alignment (Xu et al., 6 Feb 2026)), or quantization losses (vector quantization of virtual tokens).
Task-driven end-to-end learning: Vision encoders and joint-image renderers (S2I, DrAction) are trained jointly with or under the guidance of large pretrained MLLMs, with curriculum schedules for alignment, discriminative learning, and causal distillation (Wang et al., 18 Mar 2026).
Auxiliary harmonization modules: Networks lifting 2D skeletons to 3D (Wang et al., 4 Jun 2025), or CLIP-based semantic encoders for motion direction, facilitate unification of datasets and modalities.

4. Applications and Empirical Performance

Skeleton-agnostic representations have been validated in multiple domains:

Action Recognition: Universal backbones achieve state-of-the-art or near-SOTA accuracy across NTU-60/120, PKU-MMD, and cross-format transfer settings (e.g., (Wang et al., 4 Jun 2025) achieves 87.8%/93.7% on NTU-60 x-sub/x-view; (Yang et al., 6 Mar 2026) 85.8% on NTU-60 C-sub with 3s-S2I linear probe).
Motion Transfer and Animation: NECromancer supports arbitrary cross-species transfer, preserving motion semantics and reconstructing unseen skeletons with high fidelity (Xu et al., 6 Feb 2026). Skelebones/PartMM yields >21% RMSE improvements over LBS in garment/animal reanimation (Wang et al., 9 Apr 2026).
HAR Across Modalities: SKELAR demonstrates robustness in both full-shot and few-shot settings, outperforming label and text embedding baselines by up to 8% points on novel modalities and scenario transfer (Li et al., 17 Mar 2025).
3D Articulation Estimation: LIMR reconstructs previously unseen articulated object categories, achieving 8.3%–13% improvements in keypoint or Chamfer metrics over previous category-dependent architectures (Zhang et al., 2024).
Universal Multi-modal Understanding: SkeletonLLM demonstrates cross-format action recognition, motion captioning, and reasoning (including fine-grained and temporal QA) without per-format engineering, outperforming text-alignment and image-projection baselines by significant margins (Wang et al., 18 Mar 2026).

5. Limitations and Open Challenges

Topology Gaps: While models like NECromancer and Skelebones cover large topological variations, extreme morphology gaps (e.g., from humans to multi-limbed creatures) still present quality limitations (Xu et al., 6 Feb 2026, Wang et al., 9 Apr 2026).
Information Loss: Flattening or imageification discards explicit kinematic tree information (see S2I; (Yang et al., 6 Mar 2026)), potentially reducing the granularity of spatial reasoning compared to graph-based approaches. Future methods may combine image-based encodings with learned or positional adjacency.
Compute Overhead: Topology-agnostic encoders (OwO, TAT, RVQ) and large-scale transformers may incur substantial computational costs (Xu et al., 6 Feb 2026).
Precision on Fine Motions: Representations originating from coarse forms (joint, bone) may underperform on actions requiring fine-grained morphological cues (e.g., finger or hand gestures), unless extended with additional streams (RGB, optical flow) (Wang et al., 2022).
Motion-Adaptive Instability: Systems that self-refine structure over time (e.g., Skelebones, LIMR) can experience instability or require large temporal context to stabilize skeleton extraction.

6. Future Directions

Major research trajectories in skeleton-agnostic representation include:

Cross-modal and Cross-modal Mimicry: Extending frameworks to simultaneously incorporate, transfer, or co-train with non-skeletal streams (heatmaps, depth, RGB, inertial signals) for enhanced universality (Wang et al., 2022, Li et al., 17 Mar 2025).
Adaptive curriculum and dynamics: Dynamic adjustment of mimicry weights, temperature schedules, or curriculum strategy to stabilize and accelerate early-stage learning and transfer (Wang et al., 2022).
Scaling Real-World and Synthetic Data: Leveraging large-scale synthetic data (e.g., text-to-motion pipelines) to fill gaps in skeleton availability, achieving robust domain adaptation (Li et al., 17 Mar 2025).
Integration with high-level reasoning: Deeper fusion of skeleton-agnostic encodings with MLLMs for rich temporal–causal reasoning, explainability, and diverse output tasks (Wang et al., 18 Mar 2026).
Lightweight and real-time inference: Architectures and distillation for deployment on resource-constrained platforms and for real-time animation (Xu et al., 6 Feb 2026).
Beyond kinematic models: Cross-category, non-human articulation and deformable surface modeling without restricting priors, as in implicit representation frameworks (Zhang et al., 2024, Wang et al., 9 Apr 2026).

7. Comparative Summary Table

Approach/Class	Key Methodology (Abbr.)	Primary Skeleton-Agnostic Mechanism
Unified Prompting	Prompted master topology (Wang et al., 4 Jun 2025)	Prompt tokens + fusion; supports arbitrary $J_i, D_i$
Tokenization/Graph	OwO+TAT+RVQ (Xu et al., 6 Feb 2026)	Per-skeleton priors, virtual joint token, topology-agnostic quantization
Imageification	S2I (Yang et al., 6 Mar 2026), DrAction (Wang et al., 18 Mar 2026)	2D reformatting or visual rendering; processed by ViTs or MLLMs
Self-supervised metric	Coarse-angle/feature consistency (Li et al., 17 Mar 2025)	Quantized joint-angle recon, joint dropout, domain invariance
Implicit structure	Iterative optimization (Zhang et al., 2024, Wang et al., 9 Apr 2026)	Joint/bone discovery, ARAP/rigidity, mean curvature skeleton

This progression demonstrates the evolution from hand-crafted data harmonization and per-skeleton architectures towards fundamentally skeleton-agnostic paradigms, supporting universal inference, motion analysis, and synthesis across both previously seen and novel morphologies.