Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unified Motion-Action Modeling for Heterogeneous Robot Learning

Published 15 Jun 2026 in cs.RO | (2606.16917v2)

Abstract: We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.

Summary

  • The paper presents UMA, a joint motion-action model that bridges visuomotor control and dynamics using a masked generative transformer framework.
  • It employs unified tokenization and contrastive learning to integrate heterogeneous data from action-free videos, simulated trajectories, and real robot demonstrations.
  • Empirical results show UMA surpasses baselines with 20–25 percentage point improvements in zero-shot control and notable reductions in motion prediction error.

Unified Motion-Action Modeling for Heterogeneous Robot Learning: An Authoritative Summary

Motivation and Prior Work

The paper addresses a fundamental challenge in robotic foundation models: bridging the gap between visuomotor control and dynamics modeling, especially in heterogeneous data settings that combine action-free human videos, real robot data, and simulated trajectories. Prior approaches have either focused on motion-conditioned policies, requiring rigid motion-action pairings, or separately modeled dynamics, often in raw pixels or states tied to specific embodiments. These approaches limit transferability and underutilize available action-free video data.

Object motion, specifically 3D trajectories extracted from object surfaces, is posited as a robust intermediate representation. Unlike image or pixel-based modalities, motion trajectories naturally align with robot actions, are transferable across camera views and embodiments, and can be extracted without explicit action supervision. However, previous methods have failed to exploit the representation's full potential, either by insisting on paired action-motion data or leaving action supervision unused.

UMA Model Architecture

The Unified Motion-Action (UMA) model is designed to jointly model object motion and robot actions as co-evolving variables within a masked generative sequence modeling objective. The architecture consists of:

  • Tokenization: All input modalities—visual observations, object motion, robot actions, and task intent—are mapped to a common token space. This enables flexible conditioning and prediction regimes within the shared transformer backbone.
  • Task Encoder: A PointNet++-based encoder ingests sparse reference motions and initial scene observations, producing a task latent invariant to scene geometry, keypoint density, and temporal sampling. This is further regularized via a SimCLR-style contrastive objective to enforce abstraction and disentanglement of task intent.
  • Masked Diffusion Transformer: The backbone employs Masked DiT blocks, applying dual adaptive layer normalization tuned separately for masked (target) and unmasked (conditioning) tokens. The transformer alternates spatial, temporal, and context attention to preserve local coupling, temporal coherence, and context flow.
  • Joint Denoising Objective: The model reconstructs masked motion and action tokens with flow-matching losses, allowing supervision from heterogeneous sources. Action-labeled robot demonstrations supervise both motion and action streams; action-free human videos supervise only motion.
  • Hindsight Relabeling: The generalization of hindsight experience replay to spatiotemporal motion contexts enables multi-task pretraining without explicit task annotation.
  • Versatile Inference: UMA supports three modes at deployment: motion-conditioned visuomotor control, action-conditioned dynamics prediction, and soft prompt-based task adaptation, all using the pretrained parameters.

Cross-Domain Training and Supervision

UMA enables simultaneous pretraining across action-free videos, simulated robot data, and real robot demonstrations. The masking regime determines which modalities contribute supervision in each episode, accommodating missing action labels and varying embodiment-specific action spaces. Contrastive learning over task latents aligns task intent irrespective of modality, object pose, or scene geometry, further enhancing generalization.

Numerical Results and Empirical Analysis

UMA is evaluated on real-world manipulation tasks, including rigid object insertion, tool use, and deformable folding. Key quantitative findings:

  • Zero-shot Visuomotor Control: UMA outperforms COIL (Cao et al., 5 Dec 2025) and UVA (Li et al., 28 Feb 2025) baselines by 20–25 percentage points in task success rate, demonstrating the advantage of joint motion-action modeling over conditioning alone.
  • Dynamics Prediction: UMA achieves lower motion-prediction MSE (0.042) compared to state-of-the-art PointWorld (Huang et al., 7 Jan 2026) (0.054). Removing simulation data causes a fivefold increase in MSE, indicating its critical role in regularizing motion-action coupling.
  • Few-shot Adaptation: With only 25 demonstrations—either action- or motion-supervised—UMA matches or exceeds success rates of models finetuned with LoRA (Intelligence et al., 22 Apr 2025), with improvements of 10–25 percentage points in more complex tool use and folding tasks.
  • Data Ablation: Simulated robot data primarily drives motion-action coupling, while human video contributes task diversity for transfer, especially in deformable and tool-use scenarios.
  • Contrastive Objective: Removing contrastive training reduces few-shot adaptation performance by up to 60 percentage points, confirming its necessity for robust, transferable task latents.

Failure analysis reveals that execution errors, not grasping or intent inference, dominate, pointing to the need for additional motion-action paired data for closing performance gaps.

Theoretical and Practical Implications

UMA's central theoretical contribution lies in the generalizability of object-centric 3D motion as a unified interface for robot learning and control. This bridges pixel-based approaches and rigid state-space dynamics, enabling multimodal pretraining and inference without requiring embodiment-specific retraining. The architecture demonstrates that joint modeling of motion and action, paired with masked generative objectives and spatially invariant task latents, can reliably transfer manipulation skills across domains, scenes, and tasks.

Practically, UMA enables efficient adaptation from action-free demonstration videos, reducing the reliance on costly robot data collection. It also integrates seamlessly with alternative task specifications (e.g., language instructions, goal images), enhancing the flexibility and scalability of robot foundation models.

The scalability analysis suggests both data and model scale independently contribute to performance, with no obvious plateau observed at the largest tested configurations. Masked diffusion transformers with structured attention emerge as especially effective for multimodal token processing.

Future Directions

The notable limitations include:

  • Dependence on embodiment-specific action heads for unseen robot types.
  • Requirement for calibrated extrinsics since all representations live in camera frame.
  • Current design conditions on single-step observations, leaving multi-step histories or dense point correspondences underexplored.

Future work may address these by introducing cross-embodiment transfer mechanisms, leveraging multi-view or history-based conditioning, and scaling motion-action pretraining further.

Conclusion

UMA provides a strong foundation for generalizable, cross-domain robot learning by unifying motion and action modeling through masked diffusion transformers and spatially invariant task representations. Its empirical results validate the model's superiority over specialized baselines across zero-shot, few-shot, and heterogeneous data regimes. The work suggests scaling motion-centric pretraining and further enriching motion-action paired data are promising vectors for achieving broad manipulation generalization (2606.16917).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about (big picture)

Robots learn best when they understand how objects move in the real world, not just what pixels look like. This paper introduces UMA (Unified Motion-Action), a single model that learns two things together:

  • how objects move in 3D (like the path of a mug sliding on a table), and
  • what the robot should do (its actions) to make that motion happen.

By using object motion as a “shared language,” UMA can learn from many kinds of data at once—robot demos with actions, human videos without action labels, and simulated data—and then use the same learned knowledge for different tasks like controlling a robot, predicting what will happen, or adapting to a new job.

The key questions (in simple terms)

The authors ask:

  • Can one model learn both “what will happen if I do this?” and “what should I do to make this happen?” using the same training?
  • Can it learn from mixed sources (robot data, human videos, and simulation) even if some data is missing action labels?
  • Will that lead to better generalization (working in new situations) and faster adaptation to new tasks?

How UMA works (explained simply)

Think of UMA like a smart “fill-in-the-blank” system for robot behavior:

  • Object motion as a common language:
    • The model represents object movement as 3D paths of points on objects over time—imagine sticking tiny stickers on an object and tracing where each sticker goes. This is easier to compare across different cameras and robots than raw video pixels.
  • Tokens and masks:
    • The model turns everything—task intent, camera observations, object motion, and robot actions—into “tokens,” similar to words in a sentence.
    • During training, it hides (masks) some tokens and learns to guess the missing parts. Depending on what’s hidden, UMA learns either to predict actions from motions, motions from actions, or both together.
  • One “task code” for intent:
    • UMA builds a small “task latent” (think: a short code describing the goal, like “push the block into the box”). It learns this code from an initial image and a small example of desired motion (a “reference motion”).
    • Contrastive learning keeps this task code focused on the goal, not on camera angle or object placement. In everyday terms, it learns “what needs to happen” rather than “exactly where everything was in one video.”
  • A single transformer that does both:
    • UMA uses a transformer (a model that pays attention to what matters) with a diffusion-style training rule called “flow matching.” You can think of it as turning a noisy, rough guess into a clean, accurate prediction.
    • Structured attention helps it focus properly: within one moment in time (linking actions to object movement), and across time (tracking how a point moves).
  • Three ways to use UMA after training: 1) Motion-conditioned control: Given the goal motion (like “move this towel like this”), UMA outputs robot actions to make it happen. 2) Action-conditioned prediction: Given proposed robot actions, UMA predicts how objects will move—useful for planning and checking if a plan will work. 3) Few-shot task adaptation: With just a handful of examples (even only human videos), UMA adjusts the task code instead of retraining the whole model, quickly adapting to new tasks.

What the experiments show (main results)

The team tested UMA on real tabletop tasks:

  • Insertion (precise 6-DoF manipulation),
  • Sweeping (tool use),
  • Folding (deforming materials like cloth).

The training mix included publicly available robot demos, large human video datasets (no action labels), and simulation.

Key findings:

  • Better zero-shot control: Without extra fine-tuning, UMA performed 20–25 percentage points better than strong baselines that specialize in either motion-conditioned actions or pixel-based joint models.
  • Sharper dynamics prediction: When asked only to predict motion from actions, UMA had lower error than a specialized motion-prediction model (mean squared error improved from about 0.054 to 0.042).
  • Strong few-shot adaptation: With only 25 examples of a new task, UMA matched or beat baselines—even when those baselines fine-tuned large parts of their models. UMA only tweaked the task code.
  • Learning from varied data matters:
    • Simulation was crucial for connecting actions to object motions; removing it made motion prediction much worse and reduced success in control.
    • Human videos added task variety and helped generalize, especially for deformable tasks like folding.
  • Why the task code works: The contrastive learning step (which keeps the task code focused on “what to do” rather than “exactly where”) was essential. Removing it caused big drops (30–60 percentage points) in few-shot performance.
  • Failure analysis: Most errors came from execution (like small inaccuracies while moving), not from misunderstanding the task or choosing grasps. This hints that even better performance will come from scaling up high-quality action–motion training pairs.

Why this is important

  • One model, many uses: UMA closes the gap between two worlds—control (deciding actions) and dynamics (predicting outcomes)—using object motion as a bridge.
  • Works with less supervision: Because it can learn from action-free videos (like YouTube-style clips), UMA doesn’t depend solely on expensive robot-labeled data.
  • Faster adaptation: For new tasks, you can adjust a small task code from just a few demonstrations (even human-only videos), rather than retraining huge models.
  • More robust generalization: Motion is a physics-grounded signal that transfers better across cameras, scenes, and robot types than raw pixels alone.

Limitations and future impact

  • New robots still need an “action head”: The model’s understanding of motions transfers well, but a new robot must still learn how its own motors map to actions.
  • Camera calibration helps: Because UMA’s predictions live in a camera’s 3D frame, accurate camera-to-robot calibration is needed to execute actions precisely.
  • Short visual context: UMA currently uses one observation per step; using longer visual histories could further improve accuracy.

In short, UMA suggests a practical path toward more general robotic skills: learn object motion and robot action together, train on many kinds of data, and reuse the same model for planning, control, and quick task adaptation.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper leaves the following concrete gaps and open questions that future work could address:

  • Embodiment transfer without action labels: Develop methods to deploy UMA on new robot embodiments without requiring supervised training of an embodiment-specific action head (e.g., via kinematic retargeting, inverse kinematics priors, or action-space adapters learned from unpaired data).
  • Calibration-free deployment: Remove or robustify the requirement for calibrated camera-to-base extrinsics at test time (e.g., by self-calibration, online extrinsics estimation, or world-frame alignment learned from data).
  • Leveraging temporal context: Extend UMA to condition on multi-step observation history and cross-frame correspondences during both pretraining and inference, and quantify the impact on long-horizon precision and robustness.
  • Uncertainty-aware prediction and control: Incorporate and evaluate uncertainty estimates over motion and action predictions (e.g., stochastic rollouts, ensembles) to enable risk-aware MPC and safer execution.
  • Motion extraction robustness: Quantify how noise, drift, and failures in 3D motion extraction from action-free videos (occlusions, non-Lambertian surfaces, scale ambiguity, rapid motion) affect pretraining and downstream control, and design noise-tolerant objectives or confidence-weighted supervision.
  • Metric scale consistency in action-free videos: Ensure consistent metric 3D motion across heterogeneous human videos (monocular depth scale ambiguity) and evaluate the effect of scale errors on action learning.
  • RGB-only deployment: Assess UMA’s dependence on RGB-D at inference and explore RGB-only operation (e.g., learned monocular depth priors, multi-view fusion) with quantified performance trade-offs.
  • Pose-invariant task latent vs. absolute goals: Investigate when pose-invariant task latents c hinder tasks requiring absolute global alignment, and develop mechanisms to encode absolute targets when needed without sacrificing invariance benefits.
  • Data mixture optimization and scaling laws: Systematically characterize how performance scales with the quantities and ratios of simulated action-motion pairs, action-free human videos, and real robot data; learn mixture weights or curricula rather than using fixed blends.
  • Masking strategy design: Replace random mask sampling with learned or curriculum-based masking schedules to target specific capabilities (e.g., long-horizon motion, sparse keypoints, partial actions) and evaluate gains.
  • Long-horizon planning: Study UMA’s behavior for horizons beyond H, including compounding prediction errors, and integrate hierarchical planning or receding-horizon strategies calibrated to UMA’s rollout reliability.
  • Precision in contact-rich manipulation: Address the dominant “execution failures” by integrating constraint-aware low-level controllers, force/tactile feedback, or contact-implicit objectives during pretraining, and quantify their effect on insertion and other high-precision tasks.
  • Deformable object modeling: Evaluate whether point-trajectory representations sufficiently capture complex deformable dynamics; explore augmentations with mesh/particle states or differentiable simulators for cloth/fluids.
  • Multi-object and cluttered scenes: Test UMA on tasks with simultaneous manipulation of multiple objects, heavy clutter, and inter-object interactions; extend tokenization and attention to robustly handle multi-object coupling.
  • Multi-camera and viewpoint robustness: Assess performance under large viewpoint changes, moving cameras, and multi-camera setups; develop fusion mechanisms that exploit multi-view geometry during inference.
  • Language-conditioned task specification: Move beyond the optional mention to implement and thoroughly evaluate language-to-task-latent conditioning, including cross-modal contrastive pretraining, out-of-domain language, and compositional instructions.
  • Cross-embodiment generalization at test time: Rigorously test zero-shot transfer across robot arms, grippers, and action spaces, and explore universal action parameterizations (e.g., end-effector twists, motion primitives) enabling wider reuse.
  • Real-time latency and resource constraints: Measure control-loop latency and compute/memory footprint under realistic hardware budgets; optimize UMA (e.g., distillation, token pruning, fewer denoising steps) for time-critical tasks.
  • Safety and failure recovery: Incorporate safety constraints, collision prediction, and recovery behaviors into UMA’s control loop, and evaluate safety-critical metrics beyond task success (near misses, contact forces).
  • Evaluation breadth and metrics: Expand evaluation beyond three tabletop tasks and MSE/success rate to include robustness to sensor noise, calibration errors, domain shifts, and ablations on extreme conditions.
  • Keypoint selection and sparsity: Study how the number, distribution, and selection strategy of motion keypoints (K) affect performance, and develop adaptive or task-aware keypoint sampling policies.
  • Soft prompt tuning efficiency: Characterize data efficiency and stability of soft prompt tuning across 1–25-shot regimes, analyze overfitting/forgetting dynamics, and compare against alternative parameter-efficient tuning methods.
  • Planning with UMA dynamics: Provide a thorough head-to-head with strong MPC/planning baselines using UMA as the forward model (optimizers, costs, sampling strategies), and analyze when motion-conditioned vs action-conditioned rollouts are preferable.
  • Robustness to extrinsic errors: Quantify UMA’s degradation under extrinsic calibration errors and propose online correction mechanisms (e.g., aligning predicted vs observed motion via SE(3) consistency checks).
  • Open-source motion supervision pipeline: Release and benchmark the full 3D motion extraction stack across datasets, reporting failure modes and providing benchmarks to standardize motion-supervised pretraining.

Practical Applications

Immediate Applications

The following applications can be deployed now using the paper’s methods and findings, given the stated system constraints and available tooling.

  • UMA-powered changeovers in manufacturing and assembly (Robotics, Industrial Automation)
    • Use case: Reprogram cell tasks like insertions, fastening, cable routing, or part reorientation from a handful of in-situ demos or short human videos recorded on the line.
    • Product/workflow: “Motion Prompt Tuner” (soft-prompt adaptation from 10–25 demos); ROS2 node that serves motion-conditioned actions; UMA Dynamics Scorer for pre-execution rollouts.
    • Assumptions/dependencies: Calibrated camera-to-base extrinsics; embodiment-specific action head trained once per robot family; access to RGB-D; short adaptation time window; on-prem GPU for low-latency inference; tasks not far OOD from pretraining distribution.
  • Rapid deployment of cleaning and logistics skills (Facilities, Warehousing, Hospitality)
    • Use case: Sweeping, wiping, sorting, or bin pre-alignment tasks adapted from a few robot demos or human videos.
    • Product/workflow: Mobile base with arm; closed-loop motion-conditioned policy that replans at 5–10 Hz; operator UI to load reference motion from a “golden” run.
    • Assumptions/dependencies: Moderate scene variability; reliable 3D point tracking during pretraining; basic end-effectors (squeegees, brooms); safety zones and stops; calibrated cameras.
  • Video-to-robot transfer to reduce labeling costs (Robotics Software, Data Ops, Academia)
    • Use case: Leverage action-free human video corpora to broaden task coverage without collecting paired action labels.
    • Product/workflow: “UMA Data Engine” to extract 3D motion tokens from raw videos using off-the-shelf trackers (e.g., TAPIP3D, MegaSAM + depth/pose) and feed UMA pretraining; dataset curation dashboards.
    • Assumptions/dependencies: Rights-cleared videos; consistent camera intrinsics for depth or monocular depth models; acceptable tracker accuracy; storage and compute for large-scale pretraining.
  • Safer execution via motion-based dynamics prediction (Robotics, Safety/QA)
    • Use case: Score candidate action sequences against desired object motion before execution using UMA’s action-conditioned motion rollouts (MPC scoring).
    • Product/workflow: UMA Dynamics Scorer embedded in a model-predictive controller; configurable thresholds to block high-risk trajectories; logs for auditability.
    • Assumptions/dependencies: Real-time inference budget; reliable candidate action sampling; calibrated extrinsics; tasks within UMA’s learned dynamics regime.
  • Teleoperation assistance and “auto-complete” (Manufacturing, Remote Ops)
    • Use case: Operator provides short motion sketch or reference clip; UMA fills in low-level actions and stabilizes execution in closed loop.
    • Product/workflow: Tablet/VR UI for selecting or sketching reference motion; UMA policy runs onboard; human-in-the-loop override.
    • Assumptions/dependencies: Low-latency networking; operator training; motion-sketch to token conversion; clear safety interlocks.
  • Few-shot curriculum building for labs and classrooms (Academia, Education)
    • Use case: Teach manipulation skills (insertion, folding, tool use) by adapting only the task latent across new scenes; run ablations on object-centric representations.
    • Product/workflow: “UMA Starter Kit” with pretrained checkpoint, ROS2 adapters, tutorial notebooks for soft-prompt tuning and evaluation; reproducible benchmarks using DROID-like setups.
    • Assumptions/dependencies: Access to RGB-D cameras and a 6-DoF arm; adherence to dataset licenses; modest GPU.
  • Simulation-to-real regularization for policy robustness (Robotics R&D)
    • Use case: Pair dense simulated action–motion data with limited real demos to reduce execution failures and improve generalization.
    • Product/workflow: Automated sim randomization pipeline that generates diverse motion-action pairs; periodic real-world validation.
    • Assumptions/dependencies: Quality simulated assets, randomized physics; sim-to-real gaps managed by domain randomization; continuous calibration checks.
  • Internal policy guidance on data use and safety (Policy, Compliance)
    • Use case: Draft organizational SOPs for using action-free videos in training, documenting calibration, and instituting pre-execution rollouts for safety.
    • Product/workflow: Data governance checklists; safety case templates including calibration records and rollout-based risk screens.
    • Assumptions/dependencies: Legal review on video data rights and privacy; integration with existing safety standards (e.g., ISO 10218, ISO/TS 15066).

Long-Term Applications

The following applications are feasible with additional research, scaling, or engineering to address current limitations (embodiment-specific action heads, calibration needs, single-frame conditioning, and OOD robustness).

  • Cross-embodiment, plug-and-play manipulation across fleets (Robotics, Cloud Ops)
    • Use case: Deploy one UMA backbone across heterogeneous robots with learned action adapters, minimizing robot-specific data.
    • Product/workflow: Cloud-served UMA with automatic kinematic remapping; lightweight per-robot adapter learned from a handful of kinesthetic traces.
    • Assumptions/dependencies: Reliable action-space unification or learned remapping; on-device safety monitors; robust network connectivity.
  • Open-world household assistants learning from public videos (Consumer Robotics)
    • Use case: Home robots learn chores (tidying, dish sorting, surface cleaning, laundry) from smartphone/YouTube videos without action labels.
    • Product/workflow: “MotionSketch Studio” for selecting/annotating reference motions from videos; UMA-based home policy with continuous refinement.
    • Assumptions/dependencies: Highly robust 3D motion extraction under occlusion, variable lighting, and handheld cameras; strong safety and privacy guarantees; certification/regulation.
  • Hospital logistics and sterile manipulation (Healthcare, MedTech)
    • Use case: Non-contact tasks (instrument staging, linen folding, cart loading) adapted from brief staff videos; strict safety oversight.
    • Product/workflow: UMA integrated with hospital robots; pre-execution rollouts plus force/vision monitors; audit trails.
    • Assumptions/dependencies: Regulatory approval; validated failure bounds; additional sensing (force/torque, tactile); robust decontamination workflows.
  • Construction and field tool use (AEC, Utilities, Energy)
    • Use case: Tool-centric tasks (sweeping debris, applying sealant, cable manipulation) guided by motion references; adaptation to outdoor variability.
    • Product/workflow: Multi-camera UMA with history-aware conditioning and dense cross-frame correspondences; ruggedized hardware.
    • Assumptions/dependencies: Multi-view calibration at scale; weather and dust-robust perception; improved OOD generalization.
  • Autonomous factory lines with intent mining from video logs (Manufacturing, MES/PLM)
    • Use case: Infer task latents from historical production videos to document processes, auto-generate specs, and bootstrap new cells.
    • Product/workflow: “UMA Planner” module integrated with MES/PLM to extract task embeddings, simulate rollouts, and propose controllers.
    • Assumptions/dependencies: Reliable retrieval and indexing of long-horizon video; multi-step observation history in UMA; traceability and versioning.
  • Shared autonomy via real-time intent inference (Human–Robot Collaboration)
    • Use case: Robots infer human intent from ambient video, aligning their actions on the fly without explicit commands.
    • Product/workflow: On-device task-latent inference with continual update; proactive assistance cues; safety-certified interaction policies.
    • Assumptions/dependencies: Fast, reliable intent estimation; social and physical safety layers; privacy-preserving video processing.
  • Sector-wide standards for object-motion representations (Policy, Standards)
    • Use case: Establish common formats and benchmarks for motion tokens, calibration metadata, and safety validation with rollout models.
    • Product/workflow: NIST-style evaluation suites; data/format registries; compliance badges for motion-extraction pipelines.
    • Assumptions/dependencies: Multi-stakeholder alignment (academia, vendors, regulators); testbeds spanning rigid/deformable/tool tasks.
  • Tooling ecosystem around object-motion interfaces (Software, Integrators)
    • Use case: Mature product line including MotionSketch Studio (task authoring), UMA SDK (ROS2/Python), and UMA Dynamics Scorer (planning).
    • Product/workflow: End-to-end toolchain from video import to deployment and monitoring; integrations with Isaac/ROS MoveIt and commercial planners.
    • Assumptions/dependencies: Vendor support; long-term maintenance; clear licensing for pretrained backbones and datasets.
  • Energy infrastructure maintenance (Energy, Renewables)
    • Use case: Robots learn panel cleaning, blade inspection prep, cable handling from tech videos; plan and verify motion with UMA rollouts.
    • Product/workflow: Fleet management with cloud UMA; site-specific adaptation from a few human videos; MPC-based safety screening.
    • Assumptions/dependencies: Outdoor robustness; remote calibration; compliance with utility safety regulations.
  • ROI modeling and procurement strategies (Operations/Finance)
    • Use case: Quantify savings from replacing action-labeled data collection with action-free video pretraining and few-shot adaptation.
    • Product/workflow: Cost calculators bundled with UMA deployment proposals; dashboards tracking success rates and failure modes over time.
    • Assumptions/dependencies: Access to historical costs; reliable KPIs (success %, time-to-changeover); consistent evaluation protocols.

Glossary

  • 6-DoF: Six degrees of freedom describing a rigid body's 3D position and orientation (x, y, z, roll, pitch, yaw). "6-DoF manipulation (Insertion)"
  • Adaptive layer norm: A variant of layer normalization whose parameters are modulated (e.g., by diffusion time) to condition the network. "condition on diffusion time via adaptive layer norm."
  • Action-conditioned dynamics prediction: Forecasting future environment or object states given a candidate sequence of actions. "For action-conditioned dynamics prediction, used inside model predictive control, the model conditions on candidate future actions in Ga\mathcal{G}_a and sets Px\mathcal{P}_x to the resulting object-motion rollout."
  • Action-free video: Video data without corresponding action labels, used for learning from passive observations. "leaving action-free video data unused."
  • Contrastive objective: A learning objective that pulls together representations of similar inputs and pushes apart dissimilar ones to enforce invariances. "a contrastive objective that disentangles task intent from scene geometry"
  • Diffusion policies: Control policies trained via diffusion-style denoising objectives to generate actions. "standard diffusion policies~\citep{chi2025diffusion}"
  • Diffusion Transformer (DiT): A transformer architecture used as the denoising backbone in diffusion models. "a masked diffusion transformer~\citep{Peebles2022DiT, dasari2024ditpi} with structured spatiotemporal attention"
  • Embodiment-agnostic: Independent of a specific robot’s physical form or action space, enabling cross-robot generalization. "without an embodiment-agnostic intermediate"
  • Extrinsics: Camera-to-robot or camera-to-world calibration parameters used to transform between coordinate frames. "transformed to the robot base frame at deployment via calibrated extrinsics."
  • Flow matching: A training technique that learns continuous-time vector fields to transport noise to data, used as an alternative to standard diffusion losses. "trained with flow matching~\citep{lipman_flow_matching_2023}"
  • Forward dynamics model: A model that predicts how the state of a system changes in response to actions. "particle- and point-based forward dynamics models predict scene response to candidate actions"
  • Hindsight Experience Replay (HER): A relabeling strategy that treats outcomes achieved during exploration as goals for training, improving sample efficiency. "We further extend hindsight experience replay~\cite{andrychowicz2017her} from goal-conditioned to motion-conditioned training"
  • Latent rollouts: Simulated future trajectories performed in a learned latent state space for planning or evaluation. "latent rollouts for planning~\cite{ebert2018visualforesight, hafner2019planet, hafner2019dreamer, hafner2020dreamerv2}"
  • Masked autoencoding: Learning to reconstruct masked portions of the input, here over motion and action tokens, to enable self-supervised pretraining. "masked autoencoding over object-motion and action tokens."
  • Masked DiT blocks: DiT (Diffusion Transformer) blocks adapted to apply different modulation to masked (predicted) versus unmasked (conditioning) tokens. "these Masked DiT blocks apply stronger denoising to targets while preserving unmodulated conditioning"
  • Masked generative objective: A unified training objective that predicts masked tokens (e.g., motion or actions) conditioned on the unmasked context. "under a masked generative objective"
  • Model predictive control (MPC): A control method that plans actions by repeatedly optimizing over predicted future trajectories. "supports model predictive control by scoring candidate actions against the target reference motion"
  • Motion-conditioned visuomotor control: Generating robot actions conditioned on desired object motion trajectories given visual inputs. "motion-conditioned visuomotor control"
  • PointNet++: A neural architecture for hierarchical feature extraction from point clouds. "via a PointNet++~\citep{qi2017pointnet++} backbone"
  • Reference motion: A partial motion trajectory used to specify task intent for conditioning or encoding. "We call such a partial observation a reference motion"
  • SE(3): The group of 3D rigid-body transformations (rotations and translations). "random SE(3) transformations applied jointly to o0o_0 and xcx_c at the encoder input"
  • SimCLR: A contrastive learning framework that uses augmented views to learn invariant representations. "a SimCLR-style contrastive objective Lc\mathcal{L}_c~\citep{chen2020simple}"
  • Soft prompt tuning: Adapting a frozen model to new tasks by optimizing a small learnable prompt or latent vector instead of the full weights. "fast adaptation through soft prompt tuning at deployment."
  • Spatiotemporal attention: Attention mechanisms structured to capture relationships across space and time in sequence data. "structured spatiotemporal attention~\citep{coil2025}"
  • Task latent: A learned latent variable encoding task intent in a scene- and embodiment-invariant way. "and cc is a latent variable we call the task latent"
  • Vision–Language–Action (VLA) models: Models that jointly process visual inputs, natural language, and action outputs for robotic control. "vision--language--action models~\cite{brohan_rt-2_2023, kim24openvla, pi0}"
  • World models: Predictive models of environment dynamics, often learned from pixels or latent features, used for planning or policy learning. "world models~\cite{hafner2020dreamerv2, yang2023unisim, agarwal2025cosmos}"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 105 likes about this paper.