Unified Motion-Action Modeling for Heterogeneous Robot Learning
Abstract: We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big picture)
Robots learn best when they understand how objects move in the real world, not just what pixels look like. This paper introduces UMA (Unified Motion-Action), a single model that learns two things together:
- how objects move in 3D (like the path of a mug sliding on a table), and
- what the robot should do (its actions) to make that motion happen.
By using object motion as a “shared language,” UMA can learn from many kinds of data at once—robot demos with actions, human videos without action labels, and simulated data—and then use the same learned knowledge for different tasks like controlling a robot, predicting what will happen, or adapting to a new job.
The key questions (in simple terms)
The authors ask:
- Can one model learn both “what will happen if I do this?” and “what should I do to make this happen?” using the same training?
- Can it learn from mixed sources (robot data, human videos, and simulation) even if some data is missing action labels?
- Will that lead to better generalization (working in new situations) and faster adaptation to new tasks?
How UMA works (explained simply)
Think of UMA like a smart “fill-in-the-blank” system for robot behavior:
- Object motion as a common language:
- The model represents object movement as 3D paths of points on objects over time—imagine sticking tiny stickers on an object and tracing where each sticker goes. This is easier to compare across different cameras and robots than raw video pixels.
- Tokens and masks:
- The model turns everything—task intent, camera observations, object motion, and robot actions—into “tokens,” similar to words in a sentence.
- During training, it hides (masks) some tokens and learns to guess the missing parts. Depending on what’s hidden, UMA learns either to predict actions from motions, motions from actions, or both together.
- One “task code” for intent:
- UMA builds a small “task latent” (think: a short code describing the goal, like “push the block into the box”). It learns this code from an initial image and a small example of desired motion (a “reference motion”).
- Contrastive learning keeps this task code focused on the goal, not on camera angle or object placement. In everyday terms, it learns “what needs to happen” rather than “exactly where everything was in one video.”
- A single transformer that does both:
- UMA uses a transformer (a model that pays attention to what matters) with a diffusion-style training rule called “flow matching.” You can think of it as turning a noisy, rough guess into a clean, accurate prediction.
- Structured attention helps it focus properly: within one moment in time (linking actions to object movement), and across time (tracking how a point moves).
- Three ways to use UMA after training: 1) Motion-conditioned control: Given the goal motion (like “move this towel like this”), UMA outputs robot actions to make it happen. 2) Action-conditioned prediction: Given proposed robot actions, UMA predicts how objects will move—useful for planning and checking if a plan will work. 3) Few-shot task adaptation: With just a handful of examples (even only human videos), UMA adjusts the task code instead of retraining the whole model, quickly adapting to new tasks.
What the experiments show (main results)
The team tested UMA on real tabletop tasks:
- Insertion (precise 6-DoF manipulation),
- Sweeping (tool use),
- Folding (deforming materials like cloth).
The training mix included publicly available robot demos, large human video datasets (no action labels), and simulation.
Key findings:
- Better zero-shot control: Without extra fine-tuning, UMA performed 20–25 percentage points better than strong baselines that specialize in either motion-conditioned actions or pixel-based joint models.
- Sharper dynamics prediction: When asked only to predict motion from actions, UMA had lower error than a specialized motion-prediction model (mean squared error improved from about 0.054 to 0.042).
- Strong few-shot adaptation: With only 25 examples of a new task, UMA matched or beat baselines—even when those baselines fine-tuned large parts of their models. UMA only tweaked the task code.
- Learning from varied data matters:
- Simulation was crucial for connecting actions to object motions; removing it made motion prediction much worse and reduced success in control.
- Human videos added task variety and helped generalize, especially for deformable tasks like folding.
- Why the task code works: The contrastive learning step (which keeps the task code focused on “what to do” rather than “exactly where”) was essential. Removing it caused big drops (30–60 percentage points) in few-shot performance.
- Failure analysis: Most errors came from execution (like small inaccuracies while moving), not from misunderstanding the task or choosing grasps. This hints that even better performance will come from scaling up high-quality action–motion training pairs.
Why this is important
- One model, many uses: UMA closes the gap between two worlds—control (deciding actions) and dynamics (predicting outcomes)—using object motion as a bridge.
- Works with less supervision: Because it can learn from action-free videos (like YouTube-style clips), UMA doesn’t depend solely on expensive robot-labeled data.
- Faster adaptation: For new tasks, you can adjust a small task code from just a few demonstrations (even human-only videos), rather than retraining huge models.
- More robust generalization: Motion is a physics-grounded signal that transfers better across cameras, scenes, and robot types than raw pixels alone.
Limitations and future impact
- New robots still need an “action head”: The model’s understanding of motions transfers well, but a new robot must still learn how its own motors map to actions.
- Camera calibration helps: Because UMA’s predictions live in a camera’s 3D frame, accurate camera-to-robot calibration is needed to execute actions precisely.
- Short visual context: UMA currently uses one observation per step; using longer visual histories could further improve accuracy.
In short, UMA suggests a practical path toward more general robotic skills: learn object motion and robot action together, train on many kinds of data, and reuse the same model for planning, control, and quick task adaptation.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper leaves the following concrete gaps and open questions that future work could address:
- Embodiment transfer without action labels: Develop methods to deploy UMA on new robot embodiments without requiring supervised training of an embodiment-specific action head (e.g., via kinematic retargeting, inverse kinematics priors, or action-space adapters learned from unpaired data).
- Calibration-free deployment: Remove or robustify the requirement for calibrated camera-to-base extrinsics at test time (e.g., by self-calibration, online extrinsics estimation, or world-frame alignment learned from data).
- Leveraging temporal context: Extend UMA to condition on multi-step observation history and cross-frame correspondences during both pretraining and inference, and quantify the impact on long-horizon precision and robustness.
- Uncertainty-aware prediction and control: Incorporate and evaluate uncertainty estimates over motion and action predictions (e.g., stochastic rollouts, ensembles) to enable risk-aware MPC and safer execution.
- Motion extraction robustness: Quantify how noise, drift, and failures in 3D motion extraction from action-free videos (occlusions, non-Lambertian surfaces, scale ambiguity, rapid motion) affect pretraining and downstream control, and design noise-tolerant objectives or confidence-weighted supervision.
- Metric scale consistency in action-free videos: Ensure consistent metric 3D motion across heterogeneous human videos (monocular depth scale ambiguity) and evaluate the effect of scale errors on action learning.
- RGB-only deployment: Assess UMA’s dependence on RGB-D at inference and explore RGB-only operation (e.g., learned monocular depth priors, multi-view fusion) with quantified performance trade-offs.
- Pose-invariant task latent vs. absolute goals: Investigate when pose-invariant task latents c hinder tasks requiring absolute global alignment, and develop mechanisms to encode absolute targets when needed without sacrificing invariance benefits.
- Data mixture optimization and scaling laws: Systematically characterize how performance scales with the quantities and ratios of simulated action-motion pairs, action-free human videos, and real robot data; learn mixture weights or curricula rather than using fixed blends.
- Masking strategy design: Replace random mask sampling with learned or curriculum-based masking schedules to target specific capabilities (e.g., long-horizon motion, sparse keypoints, partial actions) and evaluate gains.
- Long-horizon planning: Study UMA’s behavior for horizons beyond H, including compounding prediction errors, and integrate hierarchical planning or receding-horizon strategies calibrated to UMA’s rollout reliability.
- Precision in contact-rich manipulation: Address the dominant “execution failures” by integrating constraint-aware low-level controllers, force/tactile feedback, or contact-implicit objectives during pretraining, and quantify their effect on insertion and other high-precision tasks.
- Deformable object modeling: Evaluate whether point-trajectory representations sufficiently capture complex deformable dynamics; explore augmentations with mesh/particle states or differentiable simulators for cloth/fluids.
- Multi-object and cluttered scenes: Test UMA on tasks with simultaneous manipulation of multiple objects, heavy clutter, and inter-object interactions; extend tokenization and attention to robustly handle multi-object coupling.
- Multi-camera and viewpoint robustness: Assess performance under large viewpoint changes, moving cameras, and multi-camera setups; develop fusion mechanisms that exploit multi-view geometry during inference.
- Language-conditioned task specification: Move beyond the optional mention to implement and thoroughly evaluate language-to-task-latent conditioning, including cross-modal contrastive pretraining, out-of-domain language, and compositional instructions.
- Cross-embodiment generalization at test time: Rigorously test zero-shot transfer across robot arms, grippers, and action spaces, and explore universal action parameterizations (e.g., end-effector twists, motion primitives) enabling wider reuse.
- Real-time latency and resource constraints: Measure control-loop latency and compute/memory footprint under realistic hardware budgets; optimize UMA (e.g., distillation, token pruning, fewer denoising steps) for time-critical tasks.
- Safety and failure recovery: Incorporate safety constraints, collision prediction, and recovery behaviors into UMA’s control loop, and evaluate safety-critical metrics beyond task success (near misses, contact forces).
- Evaluation breadth and metrics: Expand evaluation beyond three tabletop tasks and MSE/success rate to include robustness to sensor noise, calibration errors, domain shifts, and ablations on extreme conditions.
- Keypoint selection and sparsity: Study how the number, distribution, and selection strategy of motion keypoints (K) affect performance, and develop adaptive or task-aware keypoint sampling policies.
- Soft prompt tuning efficiency: Characterize data efficiency and stability of soft prompt tuning across 1–25-shot regimes, analyze overfitting/forgetting dynamics, and compare against alternative parameter-efficient tuning methods.
- Planning with UMA dynamics: Provide a thorough head-to-head with strong MPC/planning baselines using UMA as the forward model (optimizers, costs, sampling strategies), and analyze when motion-conditioned vs action-conditioned rollouts are preferable.
- Robustness to extrinsic errors: Quantify UMA’s degradation under extrinsic calibration errors and propose online correction mechanisms (e.g., aligning predicted vs observed motion via SE(3) consistency checks).
- Open-source motion supervision pipeline: Release and benchmark the full 3D motion extraction stack across datasets, reporting failure modes and providing benchmarks to standardize motion-supervised pretraining.
Practical Applications
Immediate Applications
The following applications can be deployed now using the paper’s methods and findings, given the stated system constraints and available tooling.
- UMA-powered changeovers in manufacturing and assembly (Robotics, Industrial Automation)
- Use case: Reprogram cell tasks like insertions, fastening, cable routing, or part reorientation from a handful of in-situ demos or short human videos recorded on the line.
- Product/workflow: “Motion Prompt Tuner” (soft-prompt adaptation from 10–25 demos); ROS2 node that serves motion-conditioned actions; UMA Dynamics Scorer for pre-execution rollouts.
- Assumptions/dependencies: Calibrated camera-to-base extrinsics; embodiment-specific action head trained once per robot family; access to RGB-D; short adaptation time window; on-prem GPU for low-latency inference; tasks not far OOD from pretraining distribution.
- Rapid deployment of cleaning and logistics skills (Facilities, Warehousing, Hospitality)
- Use case: Sweeping, wiping, sorting, or bin pre-alignment tasks adapted from a few robot demos or human videos.
- Product/workflow: Mobile base with arm; closed-loop motion-conditioned policy that replans at 5–10 Hz; operator UI to load reference motion from a “golden” run.
- Assumptions/dependencies: Moderate scene variability; reliable 3D point tracking during pretraining; basic end-effectors (squeegees, brooms); safety zones and stops; calibrated cameras.
- Video-to-robot transfer to reduce labeling costs (Robotics Software, Data Ops, Academia)
- Use case: Leverage action-free human video corpora to broaden task coverage without collecting paired action labels.
- Product/workflow: “UMA Data Engine” to extract 3D motion tokens from raw videos using off-the-shelf trackers (e.g., TAPIP3D, MegaSAM + depth/pose) and feed UMA pretraining; dataset curation dashboards.
- Assumptions/dependencies: Rights-cleared videos; consistent camera intrinsics for depth or monocular depth models; acceptable tracker accuracy; storage and compute for large-scale pretraining.
- Safer execution via motion-based dynamics prediction (Robotics, Safety/QA)
- Use case: Score candidate action sequences against desired object motion before execution using UMA’s action-conditioned motion rollouts (MPC scoring).
- Product/workflow: UMA Dynamics Scorer embedded in a model-predictive controller; configurable thresholds to block high-risk trajectories; logs for auditability.
- Assumptions/dependencies: Real-time inference budget; reliable candidate action sampling; calibrated extrinsics; tasks within UMA’s learned dynamics regime.
- Teleoperation assistance and “auto-complete” (Manufacturing, Remote Ops)
- Use case: Operator provides short motion sketch or reference clip; UMA fills in low-level actions and stabilizes execution in closed loop.
- Product/workflow: Tablet/VR UI for selecting or sketching reference motion; UMA policy runs onboard; human-in-the-loop override.
- Assumptions/dependencies: Low-latency networking; operator training; motion-sketch to token conversion; clear safety interlocks.
- Few-shot curriculum building for labs and classrooms (Academia, Education)
- Use case: Teach manipulation skills (insertion, folding, tool use) by adapting only the task latent across new scenes; run ablations on object-centric representations.
- Product/workflow: “UMA Starter Kit” with pretrained checkpoint, ROS2 adapters, tutorial notebooks for soft-prompt tuning and evaluation; reproducible benchmarks using DROID-like setups.
- Assumptions/dependencies: Access to RGB-D cameras and a 6-DoF arm; adherence to dataset licenses; modest GPU.
- Simulation-to-real regularization for policy robustness (Robotics R&D)
- Use case: Pair dense simulated action–motion data with limited real demos to reduce execution failures and improve generalization.
- Product/workflow: Automated sim randomization pipeline that generates diverse motion-action pairs; periodic real-world validation.
- Assumptions/dependencies: Quality simulated assets, randomized physics; sim-to-real gaps managed by domain randomization; continuous calibration checks.
- Internal policy guidance on data use and safety (Policy, Compliance)
- Use case: Draft organizational SOPs for using action-free videos in training, documenting calibration, and instituting pre-execution rollouts for safety.
- Product/workflow: Data governance checklists; safety case templates including calibration records and rollout-based risk screens.
- Assumptions/dependencies: Legal review on video data rights and privacy; integration with existing safety standards (e.g., ISO 10218, ISO/TS 15066).
Long-Term Applications
The following applications are feasible with additional research, scaling, or engineering to address current limitations (embodiment-specific action heads, calibration needs, single-frame conditioning, and OOD robustness).
- Cross-embodiment, plug-and-play manipulation across fleets (Robotics, Cloud Ops)
- Use case: Deploy one UMA backbone across heterogeneous robots with learned action adapters, minimizing robot-specific data.
- Product/workflow: Cloud-served UMA with automatic kinematic remapping; lightweight per-robot adapter learned from a handful of kinesthetic traces.
- Assumptions/dependencies: Reliable action-space unification or learned remapping; on-device safety monitors; robust network connectivity.
- Open-world household assistants learning from public videos (Consumer Robotics)
- Use case: Home robots learn chores (tidying, dish sorting, surface cleaning, laundry) from smartphone/YouTube videos without action labels.
- Product/workflow: “MotionSketch Studio” for selecting/annotating reference motions from videos; UMA-based home policy with continuous refinement.
- Assumptions/dependencies: Highly robust 3D motion extraction under occlusion, variable lighting, and handheld cameras; strong safety and privacy guarantees; certification/regulation.
- Hospital logistics and sterile manipulation (Healthcare, MedTech)
- Use case: Non-contact tasks (instrument staging, linen folding, cart loading) adapted from brief staff videos; strict safety oversight.
- Product/workflow: UMA integrated with hospital robots; pre-execution rollouts plus force/vision monitors; audit trails.
- Assumptions/dependencies: Regulatory approval; validated failure bounds; additional sensing (force/torque, tactile); robust decontamination workflows.
- Construction and field tool use (AEC, Utilities, Energy)
- Use case: Tool-centric tasks (sweeping debris, applying sealant, cable manipulation) guided by motion references; adaptation to outdoor variability.
- Product/workflow: Multi-camera UMA with history-aware conditioning and dense cross-frame correspondences; ruggedized hardware.
- Assumptions/dependencies: Multi-view calibration at scale; weather and dust-robust perception; improved OOD generalization.
- Autonomous factory lines with intent mining from video logs (Manufacturing, MES/PLM)
- Use case: Infer task latents from historical production videos to document processes, auto-generate specs, and bootstrap new cells.
- Product/workflow: “UMA Planner” module integrated with MES/PLM to extract task embeddings, simulate rollouts, and propose controllers.
- Assumptions/dependencies: Reliable retrieval and indexing of long-horizon video; multi-step observation history in UMA; traceability and versioning.
- Shared autonomy via real-time intent inference (Human–Robot Collaboration)
- Use case: Robots infer human intent from ambient video, aligning their actions on the fly without explicit commands.
- Product/workflow: On-device task-latent inference with continual update; proactive assistance cues; safety-certified interaction policies.
- Assumptions/dependencies: Fast, reliable intent estimation; social and physical safety layers; privacy-preserving video processing.
- Sector-wide standards for object-motion representations (Policy, Standards)
- Use case: Establish common formats and benchmarks for motion tokens, calibration metadata, and safety validation with rollout models.
- Product/workflow: NIST-style evaluation suites; data/format registries; compliance badges for motion-extraction pipelines.
- Assumptions/dependencies: Multi-stakeholder alignment (academia, vendors, regulators); testbeds spanning rigid/deformable/tool tasks.
- Tooling ecosystem around object-motion interfaces (Software, Integrators)
- Use case: Mature product line including MotionSketch Studio (task authoring), UMA SDK (ROS2/Python), and UMA Dynamics Scorer (planning).
- Product/workflow: End-to-end toolchain from video import to deployment and monitoring; integrations with Isaac/ROS MoveIt and commercial planners.
- Assumptions/dependencies: Vendor support; long-term maintenance; clear licensing for pretrained backbones and datasets.
- Energy infrastructure maintenance (Energy, Renewables)
- Use case: Robots learn panel cleaning, blade inspection prep, cable handling from tech videos; plan and verify motion with UMA rollouts.
- Product/workflow: Fleet management with cloud UMA; site-specific adaptation from a few human videos; MPC-based safety screening.
- Assumptions/dependencies: Outdoor robustness; remote calibration; compliance with utility safety regulations.
- ROI modeling and procurement strategies (Operations/Finance)
- Use case: Quantify savings from replacing action-labeled data collection with action-free video pretraining and few-shot adaptation.
- Product/workflow: Cost calculators bundled with UMA deployment proposals; dashboards tracking success rates and failure modes over time.
- Assumptions/dependencies: Access to historical costs; reliable KPIs (success %, time-to-changeover); consistent evaluation protocols.
Glossary
- 6-DoF: Six degrees of freedom describing a rigid body's 3D position and orientation (x, y, z, roll, pitch, yaw). "6-DoF manipulation (Insertion)"
- Adaptive layer norm: A variant of layer normalization whose parameters are modulated (e.g., by diffusion time) to condition the network. "condition on diffusion time via adaptive layer norm."
- Action-conditioned dynamics prediction: Forecasting future environment or object states given a candidate sequence of actions. "For action-conditioned dynamics prediction, used inside model predictive control, the model conditions on candidate future actions in and sets to the resulting object-motion rollout."
- Action-free video: Video data without corresponding action labels, used for learning from passive observations. "leaving action-free video data unused."
- Contrastive objective: A learning objective that pulls together representations of similar inputs and pushes apart dissimilar ones to enforce invariances. "a contrastive objective that disentangles task intent from scene geometry"
- Diffusion policies: Control policies trained via diffusion-style denoising objectives to generate actions. "standard diffusion policies~\citep{chi2025diffusion}"
- Diffusion Transformer (DiT): A transformer architecture used as the denoising backbone in diffusion models. "a masked diffusion transformer~\citep{Peebles2022DiT, dasari2024ditpi} with structured spatiotemporal attention"
- Embodiment-agnostic: Independent of a specific robot’s physical form or action space, enabling cross-robot generalization. "without an embodiment-agnostic intermediate"
- Extrinsics: Camera-to-robot or camera-to-world calibration parameters used to transform between coordinate frames. "transformed to the robot base frame at deployment via calibrated extrinsics."
- Flow matching: A training technique that learns continuous-time vector fields to transport noise to data, used as an alternative to standard diffusion losses. "trained with flow matching~\citep{lipman_flow_matching_2023}"
- Forward dynamics model: A model that predicts how the state of a system changes in response to actions. "particle- and point-based forward dynamics models predict scene response to candidate actions"
- Hindsight Experience Replay (HER): A relabeling strategy that treats outcomes achieved during exploration as goals for training, improving sample efficiency. "We further extend hindsight experience replay~\cite{andrychowicz2017her} from goal-conditioned to motion-conditioned training"
- Latent rollouts: Simulated future trajectories performed in a learned latent state space for planning or evaluation. "latent rollouts for planning~\cite{ebert2018visualforesight, hafner2019planet, hafner2019dreamer, hafner2020dreamerv2}"
- Masked autoencoding: Learning to reconstruct masked portions of the input, here over motion and action tokens, to enable self-supervised pretraining. "masked autoencoding over object-motion and action tokens."
- Masked DiT blocks: DiT (Diffusion Transformer) blocks adapted to apply different modulation to masked (predicted) versus unmasked (conditioning) tokens. "these Masked DiT blocks apply stronger denoising to targets while preserving unmodulated conditioning"
- Masked generative objective: A unified training objective that predicts masked tokens (e.g., motion or actions) conditioned on the unmasked context. "under a masked generative objective"
- Model predictive control (MPC): A control method that plans actions by repeatedly optimizing over predicted future trajectories. "supports model predictive control by scoring candidate actions against the target reference motion"
- Motion-conditioned visuomotor control: Generating robot actions conditioned on desired object motion trajectories given visual inputs. "motion-conditioned visuomotor control"
- PointNet++: A neural architecture for hierarchical feature extraction from point clouds. "via a PointNet++~\citep{qi2017pointnet++} backbone"
- Reference motion: A partial motion trajectory used to specify task intent for conditioning or encoding. "We call such a partial observation a reference motion"
- SE(3): The group of 3D rigid-body transformations (rotations and translations). "random SE(3) transformations applied jointly to and at the encoder input"
- SimCLR: A contrastive learning framework that uses augmented views to learn invariant representations. "a SimCLR-style contrastive objective ~\citep{chen2020simple}"
- Soft prompt tuning: Adapting a frozen model to new tasks by optimizing a small learnable prompt or latent vector instead of the full weights. "fast adaptation through soft prompt tuning at deployment."
- Spatiotemporal attention: Attention mechanisms structured to capture relationships across space and time in sequence data. "structured spatiotemporal attention~\citep{coil2025}"
- Task latent: A learned latent variable encoding task intent in a scene- and embodiment-invariant way. "and is a latent variable we call the task latent"
- Vision–Language–Action (VLA) models: Models that jointly process visual inputs, natural language, and action outputs for robotic control. "vision--language--action models~\cite{brohan_rt-2_2023, kim24openvla, pi0}"
- World models: Predictive models of environment dynamics, often learned from pixels or latent features, used for planning or policy learning. "world models~\cite{hafner2020dreamerv2, yang2023unisim, agarwal2025cosmos}"
Collections
Sign up for free to add this paper to one or more collections.