MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Abstract: Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about a new way to turn a regular video of a person or animal moving into a 3D animation that can drive any character’s skeleton (like a game character, a robot, a cartoon animal). The method, called MoCapAnything V2, focuses on capturing motion from a single camera view and producing “animation-ready” joint rotations for many different skeleton types, not just humans.
What problem are they trying to solve?
When you animate a character, you need to tell every joint (like shoulders, knees, wings) how to rotate over time. Many older methods first guess where joints are in 3D (positions), then use math to convert positions into rotations. But there’s a big catch:
- The same joint positions can come from different rotations depending on how the skeleton is defined. Think of it like different characters having different ideas of what “up” and “forward” mean for each joint. Without knowing a joint’s local “compass,” some rotation details (like twisting along a bone) are impossible to recover from positions alone.
These older methods also often use a separate step called inverse kinematics (IK) that isn’t learnable end-to-end, so the system can’t fully adjust itself to produce the best final animation. Some pipelines even rely on predicting a 3D mesh (surface) first, which can add noise and slow things down.
Key idea and goals, in simple terms
The researchers wanted to:
- Make a single, learnable pipeline that goes from video to joint positions to joint rotations, all trained together.
- Remove the need for heavy 3D meshes in the middle to make the system faster and more stable.
- Solve the “rotation is ambiguous” problem by giving the model a clear sense of each skeleton’s local directions.
Their main trick: give the model one “reference example” from the target character—a frame where both the joint positions and the correct rotations are known—plus the character’s rest pose (like a T-pose). This single reference anchors the local axes for each joint, turning a confusing problem into a well-defined one the model can learn.
How does it work? (Methods explained with everyday analogies)
Think of the process as teaching a smart puppet controller:
- Step 1: Video to Pose
- The model watches the input video and learns to predict where each joint is in 3D over time (positions).
- It uses a “reference frame” from the target character to understand the layout of the skeleton (which joints exist and how they’re connected).
- No predicted mesh is used—just visual features and skeleton information—making it simpler and faster.
- Step 2: Pose to Rotation
- Now the model needs the exact joint rotations that will make the target character move like the video.
- This is the tricky part: positions alone don’t tell you twist. To fix this, the model also looks at:
- The rest pose: the static pose defining where bones start from.
- The reference pair: one example frame for this skeleton that shows “these positions correspond to these rotations.”
- With these, the model learns a consistent way to map positions to rotations, even for skeletons it has never seen before.
To think about “local axes,” imagine each joint has its own tiny compass. The rest pose sets the compass’s location; the reference pair tells the model which way the compass points. Then the model can correctly figure out rotations, including twist.
Under the hood, the model uses an attention mechanism that understands skeleton structure:
- Local attention follows kinematic chains (like shoulder → elbow → wrist) to reason about limbs.
- Global attention looks across the whole body (or wings, tail, etc.) to coordinate bigger motions. This combination helps it handle very different skeletons, from humanoids to birds or quadrupeds.
They also use “end-to-end training,” meaning the whole pipeline is trained together. If the rotation part needs better poses, it can nudge the pose part to improve in the right way. This joint learning avoids the mismatches that happen when parts are trained separately.
Main results and why they matter
The researchers tested their method on:
- Truebones Zoo: lots of animal motions, including species the model did not see during training (unseen).
- Objaverse: a collection of varied 3D assets, also unseen.
Key outcomes:
- Rotation accuracy improved a lot. Older pipelines often had average rotation errors around 17–20 degrees. MoCapAnything V2 reduced that to about 10 degrees, and down to 6.54 degrees on unseen skeletons.
- Fewer visual glitches. The new method avoids artifacts like joints “spinning” or limbs “flipping” because it correctly recovers twist and uses the reference anchor.
- Speed-up. By removing the mesh step, the system is roughly 20 times faster than mesh-based pipelines.
- Strong generalization. Thanks to the explicit pose intermediate and the reference anchoring, it works well even on skeletons and assets it hasn’t seen before.
Why this is important
- For animation and games: It makes it much easier to drive any rigged character—human, animal, fantasy creature—from plain videos, with clean, stable joint rotations ready for production.
- For VR/AR and film: Faster, more reliable motion capture from single-camera footage can cut costs and speed up workflows.
- For robotics and research: Better understanding of motion across different body structures can help control diverse robots or analyze animal movement.
In short, MoCapAnything V2 shows that giving the model a single example of how a particular skeleton interprets motion (the reference pair), plus training everything together end-to-end, solves a long-standing problem: turning video into high-quality, animation-ready rotations for arbitrary skeletons quickly and reliably.
Knowledge Gaps
Below is a concise, actionable list of the paper’s unresolved gaps, limitations, and open questions that future work could address:
- Dependency on a single reference pose-rotation pair: how to select this frame optimally; how many anchors are minimally sufficient; and when does one anchor fail to resolve axis/twist ambiguity (e.g., joints that do not move in the reference pose).
- Robustness to noisy or incorrect anchors: quantify sensitivity to errors in the reference rotations, mis-specified rest poses, or small rigging mistakes; develop anchor validation/outlier detection.
- Multiple anchors or an anchor bank: investigate learned retrieval of the most relevant reference(s) per frame (pose-similarity driven) vs. fixed single-anchor conditioning, and the compute–accuracy trade-off.
- Reliance on per-joint semantic names (T5 embeddings): assess sensitivity to missing/incorrect/inconsistent naming (different languages, abbreviations), and design topology-only or learned-correspondence alternatives.
- Automatic joint correspondence: when semantic labels are absent or unreliable, infer joint mapping between canonical and asset skeletons and quantify the effect on rotation accuracy.
- Handling rigs with constrained DOFs, joint limits, or unusual axis orders: ensure predicted rotations respect per-joint ranges and per-joint DOFs (1/2-DOF hinges), and evaluate limit-violation rates.
- Non-tree and production rigs: support for constraints (e.g., IK handles), cyclic graphs, auxiliary bones, stretchy bones, or multiple roots commonly found in DCC/game pipelines.
- Static-joint handling: define a test-time method to detect position-static and rotation-static joints without ground-truth sequences and quantify the impact of misclassification.
- Root trajectory and world scale: recover absolute root translation/orientation and scale under moving cameras; currently normalized to a cube, which limits real-world deployment.
- Camera motion and calibration: disentangle camera from subject motion, estimate intrinsics/extrinsics, and assess performance on handheld/egocentric footage.
- Contact and physics: integrate contact-aware or physics-based priors (e.g., foot-ground constraints) to reduce foot sliding and improve physical plausibility; add contact metrics.
- Long-horizon temporal modeling: study performance beyond T=48 frames, drift over minutes-long sequences, memory scaling, and whether the model can operate causally for streaming, real-time use.
- Occlusions and clutter: systematically evaluate robustness to heavy occlusions, motion blur, fast movements, and adverse lighting; introduce targeted augmentations or occlusion reasoning.
- Multi-person and interactions: extend to multiple actors, inter-body occlusions, and hand–object/foot–terrain contacts; evaluate on interaction-heavy datasets.
- High-DOF articulation: assess performance on fine-grained hands/face and highly segmented appendages (tails, tentacles, wings), and on skeletons exceeding 150 joints.
- Generalization to standard human benchmarks: quantify on Human3.6M, 3DPW, AMASS, etc., including hand/finger rotations and diverse activities (dance, sports).
- Domain gap to real videos: current supervision relies on asset-sourced rotations; explore weak/self-supervision (2D reprojection, cycle consistency) to reduce reliance on high-fidelity rotation labels.
- Visual encoder choices: compare frozen vs. fine-tuned DINOv2 and modern video backbones; study domain adaptation and its effect on pose/rotation robustness.
- Alternative structure encoders: benchmark GL-GMHA against other graph-transformers/GNNs; evaluate robustness to topology errors or missing edges in the skeleton graph.
- Uncertainty estimation: output per-joint confidence for rotations/twist to enable downstream fallbacks or human-in-the-loop correction; calibrate uncertainties.
- Perceptual/production metrics: supplement angle errors with animator preference studies, game-engine integration tests, edit-ability, and contact/constraint adherence scores.
- Runtime and memory profiling: report absolute fps, latency, and memory across skeleton sizes on commodity GPUs/CPUs; characterize scaling with sequence length and asset complexity.
- Rig-compatibility and parameterizations: ensure export to diverse DCC/game systems (Euler, quaternion, axis orders), with automatic projection to rig DOFs and constraint satisfaction.
- Theoretical characterization: formally state conditions under which a single reference pair uniquely determines local axes across a kinematic chain; identify provable failure cases and remedies.
Practical Applications
Immediate Applications
The findings and innovations in MoCapAnything V2 enable several practical use cases that can be deployed today. Below is a concise set of applications, each with sector linkage, potential tools/products/workflows, and feasibility notes.
- Media and entertainment: rapid video-to-animation retargeting for arbitrary rigs
- Sector: VFX, gaming, animation
- What: Drive any rigged asset (humanoid, quadruped, bird, fantasy creatures) directly from monocular video using the end-to-end
V->P->Rpipeline and reference pose-rotation conditioning. - Tools/products/workflows:
- Blender/Maya/Unreal/Unity plugin that ingests a video and a rigged asset (with rest pose + one reference pose-rotation frame) and outputs animation-ready joint rotations (e.g., FBX/Quat/Euler conversion from predicted 6D rotations).
- Cloud API/service offering “any-rig motion capture from video” with export to common DCC/game-engine formats.
- Previsualization tools for directors to block scenes using quick video captures without marker suits.
- Assumptions/dependencies:
- Target asset must be rigged (tree-structured skeleton with single root) and provide a rest pose and at least one reference pose-rotation pair.
- Rotation error (~10° average; ~6.5° on unseen skeletons) is acceptable for production or can be refined via minor cleanup.
- Input videos with reasonable visibility; severe occlusion or extreme fast motion may reduce accuracy.
- AR/VR avatars from webcam/smartphone video
- Sector: AR/VR, social platforms
- What: Real-time or near-real-time avatar animation from a single camera feed, mapped to diverse user-selected avatars.
- Tools/products/workflows:
- Desktop/mobile app that animates avatars live in social VR or streaming platforms; uses
GL-GMHAbackbone and reference conditioning to minimize artifacts (e.g., joint spinning). - Integration into virtual meetings for expressive avatars.
- Assumptions/dependencies:
- Performance depends on hardware; the reported ~20× inference speed versus mesh-based methods is relative—mobile deployment may need model distillation/optimization.
- Stable lighting, limited occlusion, and consistent framing improve reliability.
- Game modding and user-generated content (UGC)
- Sector: gaming, creator economy
- What: Allow players and creators to animate in-game characters (including nonhumanoid) from short smartphone videos.
- Tools/products/workflows:
- Modding toolkit that retargets motions to custom rigs; a creator uploads a rig and provides one reference pose-rotation frame.
- Marketplace feature to share motion clips retargeted to different assets.
- Assumptions/dependencies:
- Rig quality (hierarchy correctness, joint naming/semantics) impacts results; joint semantic labels are used via T5 embeddings.
- Sports and performance analysis (non-clinical)
- Sector: sports tech, coaching
- What: Quick motion capture for technique review from a single camera, including animals (e.g., equestrian/ canine gait).
- Tools/products/workflows:
- Coaching app that reconstructs 3D joint trajectories and rotations for form assessment; exports analytics on angular velocity and coherence metrics (leveraging reduced angular velocity error).
- Assumptions/dependencies:
- Not medical grade; normalized scale in training/evaluation (to a 1 m³ cube) means absolute dimensions require calibration; suitable for qualitative/relative assessments rather than precise biomechanics.
- Wildlife and animal research data collection
- Sector: ecology, veterinary research
- What: Extract 3D kinematics of animals from field videos to study gait and behavior; supports arbitrary skeletons with species-specific rigs.
- Tools/products/workflows:
- Research pipeline that pairs species-specific rig templates with monocular recordings to generate motion datasets.
- Assumptions/dependencies:
- Requires curated skeletal templates and at least one pose-rotation reference per species; field conditions (occlusion, motion blur) may necessitate post-processing or multi-view augmentation.
- Post-production cleanup and IK enhancement tools
- Sector: animation tooling
- What: Use the learned
Pose->Rotationmodule as a data-driven IK stage to resolve twist and reduce artifacts in traditional pipelines. - Tools/products/workflows:
- DCC plugin that takes joint positions (from any source) and produces rotations conditioned on rest pose + reference anchor, reducing twist ambiguity without manual constraints.
- Assumptions/dependencies:
- Works best with correctly labeled joints and valid rest-pose offsets; rotation-static/position-static handling should be supported in the rig.
- Synthetic motion dataset generation
- Sector: software/ML
- What: Generate diverse motion sequences for training downstream models (e.g., control policies, animation synthesis) from easily collected videos.
- Tools/products/workflows:
- Batch processing pipeline that converts web-scale videos to standardized motion datasets across many skeletons.
- Assumptions/dependencies:
- Licensing/rights management for source videos; standardization of joint semantics across rigs is needed for aggregation.
- Academic benchmarking in arbitrary-skeleton motion capture
- Sector: academia (computer vision/graphics/ML)
- What: Establish baselines for end-to-end rot recovery and cross-skeleton generalization; assess
GL-GMHAand reference conditioning on new datasets. - Tools/products/workflows:
- Open-source evaluation harness with the reported loss terms (
Lpos,Lrot,Lrot_v,Lroot), mixed-pose training schedule, and ablations on rest/reference conditioning. - Assumptions/dependencies:
- Access to representative datasets (e.g., Truebones Zoo, Objaverse) and GPU resources; reproducible joint semantic embeddings.
- Content moderation and rights verification for motion capture
- Sector: policy/compliance (platforms)
- What: Immediate policy guidance where platforms host user motion captures derived from public videos.
- Tools/products/workflows:
- Platform policies clarifying acceptable use of third-party video for motion extraction; flagging data lineage and consent requirements.
- Assumptions/dependencies:
- Legal frameworks vary by jurisdiction; implementable as terms-of-service updates without awaiting regulation.
Long-Term Applications
These applications are compelling but require further research, engineering, scaling, or validation before widespread deployment.
- Real-time teleoperation and imitation for diverse robots
- Sector: robotics, industrial automation
- What: Map human or animal motions from monocular video onto robot joints with different kinematics in real time (imitation learning; teleoperation).
- Tools/products/workflows:
- Motion retargeter that translates predicted rotations to robot-specific joint limits, dynamics, and control policies.
- Assumptions/dependencies:
- Requires robust safety layers, absolute scale calibration, hardware constraints modeling, collision avoidance, and potentially multi-view or sensor fusion for reliability.
- Clinical-grade remote rehabilitation and gait analysis
- Sector: healthcare/medtech
- What: Remote motion assessment (post-stroke rehab, musculoskeletal disorders, fall-risk analysis) from a single camera.
- Tools/products/workflows:
- Medical software that provides quantitative kinematics and progression tracking; integrated with EMR systems.
- Assumptions/dependencies:
- Regulatory clearance, rigorous validation on clinical datasets, absolute metric calibration (beyond normalized scale), and bias/occlusion robustness.
- Live sports broadcast: monocular 3D motion for multiple actors
- Sector: media/sports analytics
- What: Real-time multi-person motion capture from limited camera angles for augmented broadcast insights.
- Tools/products/workflows:
- On-the-fly retargeting to standardized athlete skeletons; visualization overlays (angles, velocities).
- Assumptions/dependencies:
- Multi-actor tracking, occlusion handling, synchronization across cameras, latency budgets, and high-throughput inference.
- Wildlife digital twins and conservation simulators
- Sector: ecology, simulation
- What: Build high-fidelity digital twins of species with validated kinematics for conservation planning and behavioral simulation.
- Tools/products/workflows:
- Libraries of species-specific rigs with standardized semantics, long-tail motion datasets captured in the wild, simulation engines.
- Assumptions/dependencies:
- Extensive species-specific rig curation, long-duration motion capture in challenging environments, domain partnerships.
- Universal motion retargeter across ecosystems
- Sector: software platforms, creator economy
- What: A cross-ecosystem standard/service to retarget motion between any rig (movies, games, AR, robotics).
- Tools/products/workflows:
- Interoperable format and APIs for joint semantics, rest poses, reference anchors, and rotation conventions; automated rig mapping tools.
- Assumptions/dependencies:
- Industry alignment on schemas/standards and tooling for joint naming and axis conventions; open consortium or vendor cooperation.
- Mixed reality training and simulation for emergency response
- Sector: public safety, defense
- What: Scenario training with rapidly captured human motions retargeted to avatars in MR environments (e.g., crowd movement, evacuation drills).
- Tools/products/workflows:
- Curriculum creation tools that turn reference videos into dynamic simulations; analytics on motion patterns.
- Assumptions/dependencies:
- Multi-agent capture, scalability, privacy-preserving data handling, scenario validation.
- Edge/mobile deployment with on-device inference
- Sector: mobile/embedded
- What: Low-latency inference on smartphones/AR glasses for consumer-grade motion capture.
- Tools/products/workflows:
- Compressed/distilled versions of the
DINOv2-based encoder andGL-GMHAmodules; hardware-aware optimizations (NNAPI, Core ML, GPU). - Assumptions/dependencies:
- Model optimization, quantization, energy constraints; maintaining rotation quality post-compression.
- Standards and governance for motion data provenance
- Sector: policy/standards
- What: Formal standards for motion data provenance, consent, and licensing across entertainment, sports, and research.
- Tools/products/workflows:
- Metadata schemas capturing source video rights, retargeting chain, skeleton semantics, and transformations; audit tools.
- Assumptions/dependencies:
- Multi-stakeholder coordination (studios, platforms, researchers), alignment with privacy laws.
- Animal–human motion translation for human–robot interaction research
- Sector: HRI/AI research
- What: Use cross-species generalization (e.g., quadruped ↔ humanoid) to study transferable locomotion and control.
- Tools/products/workflows:
- Research frameworks that leverage the explicit pose intermediate for learning invariant motion primitives across skeletons.
- Assumptions/dependencies:
- Broader datasets, robust cross-domain semantics, and validated mappings between dissimilar kinematics.
- Generative animation and motion editing powered by learned
Pose->Rotation- Sector: creative AI, tooling
- What: Combine learned rotation priors with generative models to synthesize or edit motions while respecting rig conventions.
- Tools/products/workflows:
- Motion editors that can “paint” desired positions and let the model infer plausible rotations (including twist), with temporal consistency (
Lrot_v). - Assumptions/dependencies:
- Further research on controllability, editing semantics, and user interfaces; expanded datasets for diverse styles.
Notes on cross-cutting assumptions:
- Rigging requirements: a consistent rest pose, valid bone offsets, and at least one reference pose-rotation pair are central to resolving coordinate ambiguity.
- Semantics and naming: joint labels inform generalization (via T5 embeddings); mismatches or missing semantics can degrade results.
- Scale and calibration: the training normalization to a 1 m³ cube necessitates calibration for applications needing absolute measurement (robotics, clinical).
- Input quality: severe occlusions, extreme lighting, or motion blur may require multi-view, additional sensors, or robustification.
- Compute: while the pipeline is ~20× faster than mesh-based baselines, practical throughput depends on hardware and model optimization for the target platform.
Glossary
- 6D rotation representation: A continuous 6-parameter encoding of 3D rotations that avoids singularities of Euler angles. "each rt € RJx6 is parameterized as a 6D rotation representation [Zhou et al. 2019]"
- analytical inverse-kinematics (IK): A closed-form or iterative geometric method that recovers joint rotations from positions using predefined constraints. "an analytical inverse-kinematics (IK) stage recovers joint rotations."
- angular velocity error: A metric measuring the difference in rotational velocities between predicted and ground-truth joint rotations. "AngV Err (angular velocity error, °)."
- arbitrary-skeleton motion capture: Motion capture that targets diverse, user-specified rig structures rather than a fixed human model. "arbitrary-skeleton motion capture from monocular video"
- bone-axis twist: Rotation around a bone’s longitudinal axis that is not determined by joint positions alone. "since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous,"
- category-agnostic pose estimation (CAPE): Methods that predict keypoints for unseen object categories by matching support-query representations. "category-agnostic pose estimation (CAPE) methods"
- coordinate-system anchor: A reference signal that fixes the axes of the local coordinate frames for interpreting rotations. "this reference acts as an explicit coordinate-system anchor,"
- cross-skeleton generalization: The ability of a method to transfer across different skeleton structures without retraining. "This decomposition improves cross-skeleton generalization by leveraging pose as a shared intermediate representation."
- DINOv2: A vision transformer pre-trained in a self-supervised manner used here as a frozen feature encoder. "extracted by a frozen DINOv2 [Oquab et al. 2023] encoder"
- FiLM modulation: Feature-wise linear modulation that conditions a network on auxiliary signals via learned scale and shift. "via FiLM modulation [Perez et al. 2018]"
- forward kinematics (FK): Computing global joint positions from local joint rotations along the skeleton hierarchy. "applying forward kinematics (FK) with R on S reproduces the motion seen in the video."
- geodesic angular error: The shortest-angle distance on SO(3) between predicted and ground-truth rotations. "Lrot measures the geodesic angular error between predicted and ground-truth ro- tations"
- Global-Local Graph-guided Multi-Head Attention (GL-GMHA): An attention mechanism alternating between kinematic-chain-local and global joint connectivity, guided by the skeleton graph. "we introduce a skeleton-aware atten- tion mechanism, Global-Local Graph-guided Multi-Head Attention (GL-GMHA),"
- Graph-guided Multi-Head Attention (GMHA): Multi-head attention biased by graph structure (e.g., joint connectivity and distances). "Building upon Graph-guided Multi-Head Attention (GMHA) [Gat et al. 2025], we incorporate graph-derived joint re- lations,"
- kinematic chains: Sequences of joints connected by bones forming articulated limbs used to model local dependencies. "Local layers restrict attention along kinematic chains to model intra-limb dependencies,"
- learnable inverse kinematics module: A neural, data-driven Pose-to-Rotation component that replaces analytical IK and supports end-to-end training. "The pose-to-rotation stage is formulated as a learnable inverse kinematics module,"
- local coordinate frames: Per-joint coordinate systems defining how rotations are expressed relative to the skeleton. "under different rest poses and local coordinate frames,"
- Mean Per Joint Position Error (MPJPE): The average Euclidean distance between predicted and ground-truth joint positions. "MPJPE (Mean Per Joint Position Error, cm)"
- Mean Per Joint Velocity Error (MPJVE): The average error in joint velocities, capturing temporal accuracy of motion. "MPJVE (Mean Per Joint Velocity Error, cm)"
- mesh intermediate: An intermediate mesh prediction step used to aid pose estimation, which can introduce error propagation. "we remove the mesh intermediate used in prior work [Gong et al. 2025]."
- mixed-pose training: A strategy that feeds a mix of ground-truth and predicted poses to the rotation module to bridge train-test gaps. "we employ a mixed-pose training strategy"
- monocular video: Single-camera video input, as opposed to multi-view or depth-sensor inputs. "from monocular video"
- per-joint semantic embeddings: Text-encoded joint identifiers used to inform the model about joint identities across skeletons. "Per-joint semantic embeddings, obtained by encoding joint names with the T5 [Raffel et al. 2020] text encoder,"
- pose-to-rotation (P->R): The mapping from 3D joint positions to local joint rotations in the target skeleton’s coordinate system. "the ill-posed nature of the P->R mapping"
- reference cross-attention: A decoding step where joint queries attend to reference features to retrieve rotation anchoring. "The reference cross-attention is applied in the first Lcross ≤ L decoder layers;"
- reference pose-rotation pair: A single frame’s joint positions and rotations from the target asset used to fix rotation axes. "Table 4 examines the contributions of the reference pose-rotation pair and the rest-pose encoding."
- rest pose: The skeleton’s neutral configuration (e.g., T-pose) that defines joint locations and serves as a coordinate origin. "the rest pose fixes the origin of each joint's local frame,"
- rigged skeleton: A hierarchical joint structure prepared for animation with defined bones and skinning. "We assume S is a tree-structured rigged skeleton with a single root;"
- Rotary Position Embedding (RoPE): A positional encoding technique that injects relative position in attention via rotation in embedding space. "with Rotary Position Embedding (RoPE) [Su et al. 2024] across frames."
- skeleton topology: The connectivity and hierarchy of joints (the structural layout of the skeleton). "diverse skeleton topologies."
- topology-agnostic: Designed to operate across different skeleton structures without being tied to a specific topology. "topology-agnostic skeleton sequence"
- Video-to-Pose (V->P): The module that predicts 3D joint positions from video frames. "A learned Video-to-Pose (V->P) network first predicts 3D joint positions,"
Collections
Sign up for free to add this paper to one or more collections.