Papers
Topics
Authors
Recent
2000 character limit reached

Category-Agnostic Motion Capture

Updated 12 December 2025
  • Category-Agnostic Motion Capture (CAMoCap) is a methodology for reconstructing 3D articulated motion of arbitrary objects using flexible, category-free skeleton models.
  • Recent systems integrate prompt-driven asset encoding, video feature extraction, and constraint-aware inverse kinematics to produce temporally coherent, per-joint pose estimates.
  • Applications span VFX, gaming, robotics, and accessibility, while challenges include handling non-normative morphologies and ensuring physical plausibility in reconstructed motions.

Category-Agnostic Motion Capture (CAMoCap) refers to the set of methods and systems that reconstruct the 3D articulated motion of arbitrary, previously unseen objects (including humans, animals, robots, and more abstract or non-standard morphologies) from input signals such as monocular video, multi-view imagery, or wearable sensors. In contrast to traditional motion capture approaches that rely on fixed, category-specific kinematic templates or pre-determined skeletal models, CAMoCap methods generalize to arbitrary skeletal topologies and morphologies, enabling broad inter-species, inter-modal, and non-normative motion reconstruction (Gong et al., 11 Dec 2025). Recent advances formalize CAMoCap as the problem of mapping a dynamic visual sequence (e.g., monocular video) and a user-supplied rigged asset (mesh plus arbitrary skeleton) to temporally coherent skeletal pose sequences suitable for direct animation.

1. Problem Formalization and Scope

Category-Agnostic Motion Capture is defined as follows: Given a monocular RGB video V={It}t=1TV = \{I_t\}_{t=1}^T and a user-supplied rigged asset A=(M,S,IA)A = (\mathcal{M}, \mathcal{S}, \mathcal{I}_A)—where M\mathcal{M} is the 3D mesh, S=(J,E,o)\mathcal{S} = (\mathcal{J}, \mathcal{E}, o) is the skeleton consisting of joints J\mathcal{J}, kinematic edges E\mathcal{E}, and rest-pose offsets oi→jo_{i \to j}, and IA\mathcal{I}_A is a set of reference renders or images—the CAMoCap task is to output a per-frame sequence of joint rotations Rt={Rt,j∈SO(3)}j∈Jt=1T{R_t = \{R_{t,j} \in SO(3)\}_{j \in \mathcal{J}}}_{t=1}^T such that applying RtR_t to AA animates the asset in accordance with the input video (Gong et al., 11 Dec 2025).

The central distinction from traditional approaches is that neither the input skeleton nor the motion class is fixed or assumed to belong to a known species or category. The mapping must accommodate arbitrary skeleton topology, arbitrary mesh geometry, and be robust to a wide variety of observed motions, limb configurations, and non-standard morphologies, including mobility aids and assistive devices (Hilton et al., 11 Jul 2025).

2. Foundational Architectures and Methodologies

Recent state-of-the-art CAMoCap frameworks employ various forms of prompt-driven, modular, and factorized architectures. The representative "MoCapAnything" system (Gong et al., 11 Dec 2025) comprises three key learnable modules and a constraint-aware inverse kinematics (IK) stage:

  1. Reference Prompt Encoder: This module encodes the arbitrary rigged asset's geometry and topology into per-joint query embeddings Q={qj}Q = \{q_j\}. It assimilates joint metadata (rest-pose coordinates, one-hot names), mesh samples, and optionally rendered image tokens. Attention mechanisms propagate geometric and topological context across skeleton and mesh.
  2. Video Feature Extractor: Extracts dense visual descriptors from video frames (using a frozen DINOv2 ViT) and reconstructs a deforming 3D mesh stream using monocular image-to-3D backbones. Both per-frame image tokens and point cloud embeddings are served as input to the motion decoder.
  3. Unified Motion Decoder: Fuses the asset-specific joint queries and temporally windowed video features (visual, geometric) through stacked attention stages—graph-based self-attention respecting skeleton topology, cross-attention to image and geometry tokens, and temporal self-attention per joint—to produce per-joint 3D trajectory estimates x^t,j\hat x_{t,j}.
  4. Constraint-Aware Inverse Kinematics: Converts predicted 3D trajectories to per-joint rotations by geometric initialization (axis–angle and Procrustes alignment), temporal smoothing, and optimization under bone-length and twist constraints:

Lt=1N∑i∥Pt,i(θt)−x^t,i∥2+λprior1N∑i∥θt,i−θt,igeo∥2+λtwist1N∑i(αt,i(a^t,i⋅ui))2\mathcal{L}_t = \frac{1}{N} \sum_i \|P_{t,i}(\theta_t) - \hat x_{t,i}\|^2 + \lambda_{prior} \frac{1}{N} \sum_i \|\theta_{t,i} - \theta^{geo}_{t,i}\|^2 + \lambda_{twist} \frac{1}{N} \sum_i (\alpha_{t,i} (\hat a_{t,i} \cdot u_i))^2

Alternative approaches leverage fully implicit representations with learned skeletons and deformation anchors extracted from monocular videos (Kuai et al., 2023), or disentangled blob-based models where pose and identity parameters are partitioned and manipulated independently for articulated 3D shape control (He et al., 26 May 2025).

3. Canonical Representations: Skeletons, Deformation, and Rigging

CAMoCap systems exploit explicit and implicit kinematic representations:

  • Explicit Skeletons: Models such as MoCapAnything use skeleton graphs with variable topology, joint/edge configuration described directly by the rigged asset (Gong et al., 11 Dec 2025).
  • Implicit Kinematic Chains: The CAMM approach (Kuai et al., 2023) learns a canonical implicit surface (signed distance function), infers a kinematic tree (via RigNet on extracted mesh), and assigns deformation anchors associated to skeleton links, enabling linear blend skinning (LBS) for animation.
  • Blob-Based Deformation: In CANOR (He et al., 26 May 2025), objects are represented as a sparse set of feature-embedded "blobs." Each blob's parameters (center, rotation, size, activation, local feature) disentangle pose from instance identity. Surface occupancy is predicted by voxelizing and blending these blobs through attention and MLPs.

All paradigms ultimately support pose editing, either by manipulating blob positions (He et al., 26 May 2025), reposing kinematic chains (Kuai et al., 2023), or driving arbitrary skeletons via decoded joint rotations (Gong et al., 11 Dec 2025).

4. Category-Agnostic Calibration and Body-Agnostic Skeleton Estimation

Wearable-based CAMoCap solutions must support arbitrary, potentially non-normative body and object structures. The EqualMotion system (Hilton et al., 11 Jul 2025) realizes this by:

  • Allowing arbitrary IMU placement analogous to a node-based, topology-agnostic skeleton, including user-customized labels (e.g., "left crutch," "wheel hub").
  • Supporting calibration from any comfortable posture (not just T-/A-poses), inferring sensor-to-segment alignment and joint centers through least-squares or sphere-fitting over motion trajectories.
  • Representing skeletons as mutable directed graphs, with parent-child edge inference driven by observed movement correlation, rather than fixed segment lists.

These principles enable inclusive use across diverse anatomies, prosthetics, and mobility aids, validated in workflows involving wheelchairs, crutches, and non-standard joint structures (Hilton et al., 11 Jul 2025).

5. Training, Optimization, and Evaluation Protocols

CAMoCap training objectives emphasize the composite reconstruction quality—both in terms of geometric accuracy and temporal/structural plausibility:

  • Reconstruction Losses: Minimization of per-frame L1L_1 (or binary cross-entropy on occupancy in blob-based methods) between predicted and ground-truth joint positions or occupancy grids (Gong et al., 11 Dec 2025, He et al., 26 May 2025).
  • Temporal Coherence: Losses enforcing smooth blob center and orientation evolution, or temporal consistency in joint trajectories (He et al., 26 May 2025, Gong et al., 11 Dec 2025).
  • Feature Matching: Canonical per-point features in implicit models are matched to pretrained 2D descriptors (e.g., DINO-ViT) via correspondence and reprojection losses (Kuai et al., 2023), promoting cross-modal alignment.
  • IK and Rotation Constraints: Loss terms penalize deviation from geometric IK initialization, excessive twist, and constraint violations in the asset-specific skeleton (Gong et al., 11 Dec 2025).

Evaluation is conducted on diverse datasets encompassing both synthetic and real-world motions. Relevant metrics include mean per-joint position error (MPJPE), mean per-joint velocity error (MPJVE), symmetric Chamfer distance (CD) between reconstructed and ground-truth keypoints, intersection-over-union (IoU) for mesh accuracy, and F-score at various thresholds (Gong et al., 11 Dec 2025, Kuai et al., 2023, He et al., 26 May 2025). Specialized datasets such as Truebones Zoo enable cross-species, cross-asset validation (Gong et al., 11 Dec 2025).

Method DeformingThings4D (IoU ↑) FaMoS (IoU ↑)
KeypointDeformer 0.536 0.923
NeuralDefGraph 0.875 0.800
SkeRig 0.802 0.790
CANOR (Ours) 0.937 0.960

CANOR demonstrates state-of-the-art mesh reconstruction and deformation fidelity across diverse categories (He et al., 26 May 2025).

6. Applications, Strengths, and Limitations

CAMoCap advances have broad implications:

  • Prompt-Driven Animation: Enables animating arbitrary 3D assets—across species, morphology, or device—directly from monocular video or sparse sensors, with intended applications in VFX, gaming, avatar prototyping, and accessibility (Gong et al., 11 Dec 2025, Hilton et al., 11 Jul 2025).
  • Generalization Beyond Templates: Category-agnostic pipelines sidestep restrictive template priors. Systems such as CAMM and EqualMotion operate without shape templates or fixed skeletal priors, generalizing to robots, animals, and impaired/augmented bodies (Kuai et al., 2023, Hilton et al., 11 Jul 2025).
  • Handling Non-Normative Morphologies: Inclusive calibration and skeletal modeling in EqualMotion support full expressivity for disabled or non-normative users, including integration of mobility aids as first-class kinematic segments (Hilton et al., 11 Jul 2025).
  • Direct Pose Manipulation: Outputs are not limited to tracking; users can explicitly re-pose or edit skeletons and rigs, facilitating downstream motion synthesis and editing (Kuai et al., 2023, Gong et al., 11 Dec 2025).

Key limitations include dependence on the quality of video-derived cues (optical flow, masks, root pose estimates), potential nonphysical poses in learned skeletons, and limited generalization to motions far outside observed training trajectories (Kuai et al., 2023). CAMoCap’s flexibility may result in skeletons or rigs that admit self-penetration or implausible joint configurations unless further physically inspired priors are incorporated.

7. Future Directions and Open Challenges

Open research challenges and future directions articulated across the field include:

  • Physics- and Collision-Integrated Models: Incorporation of plausible motion priors, collision avoidance, and articulated-body dynamics to prevent nonphysical poses and improve robustness (Kuai et al., 2023).
  • End-to-End Joint Topology Learning: Automated inference of optimal skeleton structure and connectivity from video or sensor input, without manual initialization or intervention (Kuai et al., 2023).
  • Enhanced Deformation Models: Extending beyond linear blend skinning to richer deformation bases (e.g., blend shapes, displacement fields), supporting greater expression and out-of-distribution pose generalization (Kuai et al., 2023).
  • Background and Multi-Object Reasoning: Generalizing CAMoCap to dynamic, cluttered scenes, multi-object interactions, and deformable/non-articulated entities (cloth, fluids) (Kuai et al., 2023).
  • Benchmark Datasets: Expansion of standardized, cross-category datasets such as Truebones Zoo to support more challenging, real-world, and assisted-mobility scenarios (Gong et al., 11 Dec 2025).

CAMoCap continues to evolve at the intersection of computer vision, graphics, and biomechanics, providing the core infrastructure for scalable, inclusive, and asset-agnostic 3D motion reconstruction.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Category-Agnostic Motion Capture (CAMoCap).