Papers
Topics
Authors
Recent
Search
2000 character limit reached

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Published 17 Jun 2026 in cs.RO and cs.CV | (2606.19333v1)

Abstract: How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

Summary

  • The paper introduces a novel two-stage pipeline that reconstructs 4D hand-object dynamics and robustly retargets them onto multi-fingered robotic hands.
  • It leverages advanced vision models and dynamics-aware retargeting with warmup steps, force perturbations, and transition rewards to enhance robotic grasp success.
  • The approach achieves significant SOTA improvements, boosting success rates from 25% to 71-81% in both simulation and real-world robotic tasks.

Scalable Dexterous Manipulation Dataset Generation from Monocular Human Videos with DO AS I DO

Problem Statement and Motivation

The primary challenge addressed is the scalable collection of robot-relevant dexterous manipulation data on human-like, multi-fingered robotic hands. While observational human video data is abundant, existing methodologies either require specialized sensor setups (e.g., MoCap, RGBD), rely on restrictive assumptions (predefined object classes, simple grasping), or perform poorly on in-the-wild monocular RGB videos due to the complexity of hand-object interaction estimation and profound embodiment gaps. This limits the usability of the vast corpus of monocular human videos for robotic learning, particularly for dexterous manipulation tasks.

Methodology

DO AS I DO comprises a two-stage pipeline: monocular hand-object 4D reconstruction, followed by robust, dynamics-aware retargeting onto dexterous robot hands.

Hand-Object Interaction Reconstruction

The pipeline exploits recent foundation models and robust optimization heuristics to resolve the hand and object state trajectories:

  1. Hand Tracking: Utilizes HaWoR [45], a world-space hand pose tracker, to robustly estimate hand pose in diverse video conditions.
  2. Object Reconstruction and Tracking: Employs SAM 3D [11], an image-conditioned 3D generative diffusion model, to reconstruct object shape and pose per-frame. Temporal consistency is enforced through a guided-diffusion tracking mechanism which anchors per-frame pose inference to previous-frame pose estimates.
  3. Segmentation and Depth Estimation: Combines SAM 3 [64] for mask generation and MoGe [10] for monocular depth and camera intrinsic estimation, facilitating accurate spatial alignment.
  4. Hand-Object Alignment: Aligns the reconstructed object to the tracked hand by a depth-consistent translation scaling method, achieving a consistent near-metric 4D trajectory.

Key novelties include the adaptive guidance of object pose estimation, using 2D point tracking to set dynamic guidance strengths for the guided-diffusion tracker, and an efficient pose selection phase using clustering in SE(3), which was empirically validated to match the performance of computationally intensive likelihood-based ranking.

Dynamics-Aware Dexterous Retargeting

Mapping reconstructed human hand motions to robotic actions is executed via a trajectory optimization approach built on parallelizable physics simulation (MuJoCo Warp [13]):

  1. Sampling-Based Optimization: Adopts a Model Predictive Path Integral (MPPI)-style annealed sampler [15] to generate dynamically feasible robot hand trajectories tracking the noisy, reconstructed references.
  2. Retargeting Innovations: Three key algorithmic innovations were introduced:
    • Warmup Steps: Initial trajectory warmup allows the robot hand to reach feasible pre-grasp configurations, mitigating irrecoverable states arising from reference noise.
    • Random Force Perturbations: Sim-to-real-style domain randomization via sampled forces during rollout, enforcing robustness to reference inaccuracies and simulating real-world uncertainties.
    • Transition Reward: Auxiliary rewards/penalties on critical hand-object contact transitions, specifically penalizing missed grasping or placement events.
  3. Real-World Deployment: The resulting trajectories are mapped to the robot's workspace and executed on physical bimanual platforms comprising Sharpa Wave hands and UR3e arms.

Empirical Evaluation

Hand-Object Reconstruction

On benchmarks Dex YCB [71] and HOI4D [72], the method established new SOTA in F-5/F-10 score and Chamfer distance, outperforming image- and video-based SOTA methods ([48, 49, 50, 51, 54]). Human studies on 150 internet and synthetic videos indicated DO AS I DO was preferred 67% of the time over FoundationPose [17]. Specifically, the approach demonstrates superior object pose tracking under occlusions and motion blur, yielding temporally consistent reconstructions where prior methods fail.

Ablation Studies

Adaptive pose guidance and clustering-based sample selection contribute substantially to performance while achieving real-time inference, highlighting the necessity of temporal adaptation and robust sample consensus in 3D object tracking from monocular video.

Dexterous Retargeting

Experiments on both reconstructed in-the-wild references (n=655) and large-scale clean MoCap datasets (OakInk2 [73], n=1352) illustrate a success rate boost from 25% (annealed sampling baseline/SPIDER [15]) to 71% with DO AS I DO (and up to 81% on MoCap data). Key improvements stem from the warmup and transition reward terms. Notably, qualitative improvements in robustness and naturalness of grasps are apparent, with simulated and real-world executions spanning a range of tasks and object types.

Data Curation Insights

A systematic analysis on filtering human internet video datasets (e.g., 100DOH [76]) revealed that less than 5% of raw clips are suitable for dexterous manipulation learning, underscoring the importance of rigorous pre-processing and quality control in scaling human demonstration data for robotics.

Practical and Theoretical Implications

DO AS I DO materially broadens the applicability of monocular human video for generating physically executable robot policies, breaking free from the constraints of specialized capture hardware or curated datasets. By leveraging vision foundation models and robust optimization in physics simulation, the pipeline closes critical gaps in hand-object interaction perception and human-to-robot embodiment mapping. Practically, it enables the construction of large, diverse, and realistic robot manipulation datasets by simply "watching" everyday human activity, with minimal annotation or hardware overhead.

Theoretically, the work demonstrates that reliable 4D hand-object state estimation and behavior cloning is possible even under significant observation noise and model mismatch, provided that robust temporal priors and sim-to-real optimization techniques are employed. The modularity of the pipeline also allows future integration with generalist world models, richer parameterized object priors, or scene-level reasoning frameworks.

Limitations and Future Directions

Primary limitations include the rigid-object assumption and sensitivity to monocular depth ambiguities, which restrict applicability to non-deformable object manipulation and impact precise contact understanding. The approach currently does not reconstruct or reason about broader scene context (e.g., obstacles, articulated environments), which will be vital for generalizing to real-world tasks involving complex hand-scene interactions. Further, real-world deployment fidelity is inherently capped by the accuracy and expressiveness of contemporary physics simulators.

Future research directions include relaxing rigidity constraints (e.g., via generative models for articulated or deformable objects), improving monocular depth and contact estimation, scaling to continuous in-the-wild video streams, and integrating scene-level geospatial priors and affordance reasoning for holistic observation-to-action pipelines. Enhancements in physics engine realism and transfer learning across morphologically distinct robots will further elevate deployment efficacy.

Conclusion

DO AS I DO establishes a robust pipeline for extracting and retargeting dexterous manipulation trajectories from ordinary human videos to multi-fingered robotic hands, setting new empirical benchmarks and validating its effectiveness both in simulation and real-world rollouts. The framework offers substantial promise for dataset scaling and behavior learning in robotic manipulation, given continued advances in 3D perception, optimization under uncertainty, and simulation-to-reality transfer.

Whiteboard

Explain it Like I'm 14

Overview

This paper is about teaching robot hands to do tricky, human-like tasks (like whisking, pouring, writing, or picking up objects) just by watching everyday videos of people. The system they built is called DO AS I DO. It takes a normal phone-style video (one camera, regular color frames) of someone using their hands and turns it into step-by-step robot actions that a real robot hand can perform.

What questions does the paper ask?

  • Can we turn the huge number of regular human videos online into useful training data for robots?
  • Can we figure out, from just a video, where the hand and the object are in 3D (how they move, rotate, and touch)?
  • Can we “retarget” those human motions onto a very different robot hand so the robot can actually do the same task, safely and reliably?

How does it work? (Methods in everyday language)

Think of DO AS I DO as a two-part process: first “understand the video,” then “make the robot do it.”

Part 1: Understanding the video (Reconstruction)

  • Goal: Recreate what happened in 3D—where the person’s hand and the object were and how they moved over time.
  • Inputs: A normal video (no special sensors), frame by frame.

What they do:

  • Track the hand: They use a hand-tracking model to estimate the 3D position and shape of the human hand across frames.
  • Find and model the object: They use smart vision tools to segment the object (separate it from the background) and to “inflate” a single image of it into a 3D shape (like turning a photo into a simple 3D toy model).
  • Follow the object through time: They use a “guided diffusion” tracker. Imagine starting with a blurry guess for the object’s position each frame and slowly sharpening it, while gently nudging it to stay consistent with the previous frame so it doesn’t “jump” around. They also estimate how fast the object seems to rotate in 2D to tune how strong that nudge should be.
  • Pick a stable pose each frame: They sample several possible object positions/orientations for that frame and choose the best one by finding a consensus (the one most samples agree on), which is fast and reliable.
  • Line up scales: Because hand and object are reconstructed separately, they have to align their sizes and positions. They do this using estimated depth (how far things are) and make sure the whole scene lines up with gravity, so “down” in the video is also “down” for the robot.

Key terms explained:

  • Monocular RGB video: A regular color video from one camera (like a phone) without any depth sensor.
  • 3D reconstruction: Estimating a 3D shape and position from 2D images—like guessing the shape of a cup from a photo.
  • Diffusion model (here, guided diffusion): A method that starts from noise and refines a guess step by step into a clean, consistent result.

Part 2: Making the robot do it (Retargeting)

  • Goal: Convert the human hand/object motion into robot hand actions that actually work in the real world.
  • Challenge: Human hands and robot hands are very different (finger lengths, joints). Simply copying angles won’t work, and it might cause the robot to drop or poke through the object.

What they do:

  • Physics-aware optimization: They try lots of small variations of robot movements inside a fast physics simulator and pick the ones that best match the target motion while obeying real-world rules (no ghosting through objects, stable grasps).
  • Three practical upgrades that make this robust to messy video estimates:
    • Warmup steps: The robot gets a short “practice” phase before the main motion starts. The object is temporarily held in place so the robot can adjust its pose and get a good grasp rather than starting in a bad position and failing right away.
    • Random force bumps: They add tiny random pushes in simulation (like gentle shoves) so the robot learns grasps that can handle small disturbances, which helps in the real world too.
    • Transition reward: They add a bonus/penalty around key moments—like when the object should be on the table vs. in the hand—so the robot clearly learns to pick up and put down rather than hovering awkwardly.

Analogy:

  • Imagine learning a dance from a video. First, you figure out the steps (reconstruction). Then you practice the moves on your own body, adjusting for your height and flexibility (retargeting), and you try on a stage that mimics real physics (no floating!). Warmup is your rehearsal; random bumps are like practicing with small distractions; transition rewards are cues to hit your marks.

Main findings and why they matter

The authors tested both parts carefully:

  • Better 3D tracking from regular videos:
    • On standard benchmarks, their method outperformed previous top approaches at recovering how objects move with hands in 3D.
    • On 150 “in-the-wild” videos (messy, real-life clips), humans preferred their object tracking over a strong baseline 67% of the time. This means it stays locked onto the object more reliably despite blur or occlusion.
  • Stronger robot retargeting:
    • On challenging, reconstructed video references, their full method achieved a 71% success rate (vs. 25% for a strong baseline without their upgrades).
    • On a big, clean motion-capture dataset, they improved success from 72% to 81%—showing that the method scales and helps even when the reference is already good.
  • Real-world demos:
    • They produced 500 high-quality robot trajectories from internet, egocentric, and AI-generated videos.
    • They ran many tasks on real robot hands and arms (like whisking, pouring, writing, picking), proving the pipeline can go from “video on the internet” to “robot actually doing it.”

Why this matters:

  • Most human know-how is available as videos. Turning those videos into real robot skills could massively speed up how robots learn—especially for complex, finger-based tasks.

Implications and impact

  • Scaling robot learning: If robots can learn from regular videos, we can build huge training sets without expensive motion-capture labs or teleoperation sessions.
  • Practical advice for data collection: The authors found that only around 4–5% of random “hands interacting with objects” web clips are truly useful for dexterous learning. Careful filtering and curation are critical.
  • Current limits:
    • Works best with rigid objects (not squishy or flexible).
    • Estimating exact contact from a single camera is hard—sometimes what looks like touching is just overlapping in 2D.
    • It doesn’t reconstruct the whole room or all obstacles.
    • Simulators aren’t perfect, so there’s still a gap to the real world.

Big picture:

  • DO AS I DO is a promising step toward robots that can watch us do everyday tasks and learn to do them too. With better video models, richer scene understanding, and improved simulators, this approach could help robots handle a much wider range of real-life activities.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or left unexplored, framed to guide actionable future research.

  • Reconstruction: reliance on monocular metric depth
    • How to self-calibrate absolute scale robustly from monocular RGB when MoGe’s metric depth is inaccurate, especially under strong parallax, rolling shutter, or unknown intrinsics?
    • Can scene priors (e.g., human hand anthropometrics, gravity constraints, furniture dimensions) be used to regularize or correct absolute scale?
  • Contact inference from monocular video
    • How to disambiguate true physical contact vs. visual occlusion from RGB alone?
    • Can physics-informed priors, uncertainty-aware tracking, or multi-modal cues (audio, IMU) enable reliable contact state estimation and timing?
  • Rigid-object assumption
    • Extension to articulated (e.g., scissors, lids), deformable (cloth, sponges), and fluid-containing objects (pouring) remains open; what object representations and simulators are needed?
    • How to model within-video object state changes (e.g., deformation, liquid level) and retarget them physically?
  • Single-object, partial-scene modeling
    • The method reconstructs only the hand and a single object; how to reconstruct full scenes (surfaces, obstacles, containers, articulated furniture) and reason about hand-scene and object-scene constraints?
    • How to scale to multi-object interactions, tool use, and tasks requiring target objects (e.g., hammer-and-nail, writing on paper)?
  • Temporal consistency and drift
    • Pose tracking is per-frame guided diffusion with shape fixed at an anchor frame; can a global, temporally coherent optimization (e.g., bundle-adjustment over SE(3) and shape) reduce drift and re-acquire under long occlusions?
    • How to jointly optimize hand and object trajectories over time, rather than modular post-hoc alignment?
  • Physical parameter estimation
    • Mass, inertia, and friction are not inferred; how to estimate or identify object/material physical properties from video to improve simulation fidelity and grasp stability?
  • Uncertainty quantification and propagation
    • The tracker selects a single pose sample; how to represent and propagate pose uncertainty (e.g., distributions over SE(3)) into retargeting, planning, and safety checks?
  • Robustness under adverse imaging conditions
    • Performance under severe motion blur, specular/transparent objects, tiny objects, extreme occlusions, and rapid egocentric motion is not characterized; what robustness interventions (e.g., temporal super-resolution, reflection-aware priors) help?
  • Bimanual reconstruction at scale
    • Although retargeting is evaluated on bimanual MoCap, in-the-wild reconstruction of two hands interacting with one or more objects is not demonstrated; how to robustly reconstruct and retarget hand–hand–object interactions?
  • Coupled arm–hand retargeting
    • Retargeting optimizes the hand in simulation and then applies arm IK post hoc; how to jointly optimize arm and hand (whole-body) under environmental constraints to avoid collisions and exploit arm contacts?
  • Contact-aware retargeting objectives
    • Current objectives emphasize pose/rotation tracking with simple transition penalties; how to incorporate explicit, learned or physics-derived contact goals (contact locations, normal forces, frictional stability, no-slip constraints)?
  • Warmup and welded-object heuristic
    • The warmup strategy “holds” the object to recover from poor initial states; can contact timing and grasp initiation be inferred directly from video and enforced without non-physical welds?
  • Task-level success beyond pose error
    • Success is defined by mean position/rotation error; how to evaluate semantic task completion (e.g., liquid transfer angle/flow, paint coverage, screw insertion), grasp robustness, and force stability in sim and real?
  • Real-world evaluation scale and repeatability
    • Real deployments show 10 qualitative tasks without systematic success rates or variability analysis; can large-scale, repeatable real-world benchmarks (dozens of tasks × trials) quantify reliability and failure modes?
  • Automatic video-to-robot world registration
    • Manual (x, y, z, yaw) alignment is required; how to automatically localize and register the reconstructed scene to the robot workspace (e.g., via fiducials, object recognition, SLAM) for fully hands-off deployment?
  • Data utility for policy learning
    • The paper generates 500 trajectories but does not measure downstream learning gains; do policies trained on this data outperform teleoperation/simulation baselines on held-out tasks and objects?
  • Data triage and quality control at scale
    • The “playbook” identifies heavy attrition in internet clips, but no automated filtering is provided; can scalable, model-based triage (quality scoring, contact visibility, action detectability) be built to preselect useful videos?
  • Computational efficiency and throughput
    • Guided diffusion with per-frame sampling and clustering is compute-heavy; what temporal models or amortized trackers can cut cost by 10–100× for million-scale video ingestion?
  • Generalization across robot embodiments
    • Results are shown on Sharpa Wave + UR3e; how well does the retargeter transfer to underactuated, tendon-driven, or low-DoF hands and different arms without re-tuning?
  • Multi-stage, long-horizon tasks
    • The approach uses a simple distance threshold for “rest” vs. “in-hand”; how to infer multi-stage task graphs (subgoals, regrasp events, tool switching) and retarget them hierarchically over long horizons?
  • Closed-loop, real-time execution
    • The pipeline produces open-loop trajectories; how to integrate online perception and tactile feedback for closed-loop corrections, recovery from slippage, and robust execution under disturbances?
  • Evaluation on generated videos
    • Although claimed to handle generative videos, there is no systematic analysis of generation artifacts, domain gaps, or their impact on retargeting; what criteria and diagnostics are needed before using synthetic clips?
  • Benchmarks for in-the-wild ground truth
    • In-the-wild evaluation relies on human preference; can semi-synthetic datasets with accurate ground truth (e.g., photorealistic renders with known states) be created to benchmark reconstruction/retargeting under realistic degradations?
  • Handling fast camera cuts and shot boundaries
    • Clips with shot transitions fail; how to detect and bridge across cuts (e.g., track re-identification, cross-shot shape/pose optimization) to recover longer, edited demonstrations?
  • Safety and failure handling
    • What safety monitors and predictive checks (e.g., uncertainty thresholds, collision risk, torque/force limits) are needed to prevent hardware damage when deploying trajectories derived from noisy reconstructions?

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that can leverage the paper’s method today, together with relevant sectors, potential tools/products/workflows, and feasibility notes.

  • Scalable dataset generation for dexterous manipulation from existing human videos
    • Sector: Robotics, Software (ML Ops for robotics), Education
    • What: Convert large stores of monocular RGB videos (egocentric/exocentric/internet clips) into robot-executable hand-object trajectories for training imitation and control policies, reducing dependence on teleoperation and MoCap.
    • Tools/Products/Workflows:
    • “Do-As-I-Do Dataset Builder” pipeline integrated with GPU simulators (MuJoCo Warp, Isaac).
    • Plugins for ROS/Isaac Sim/MuJoCo to ingest videos and export retargeted action sequences.
    • Course/lab kit for universities to turn YouTube or egocentric videos into dexterous control datasets.
    • Assumptions/Dependencies:
    • Rigid objects; semi-accurate monocular depth (MoGe) and segmentations (SAM 3).
    • Access to vision foundation models (SAM 3D, HaWoR) and GPU compute.
    • Physics simulators available; fidelity limitations may cap real-world success.
  • Rapid prototyping of robot demos from online videos
    • Sector: Robotics (R&D labs, startups), Marketing/Demos
    • What: Extract a reference from an instructive video and retarget to a lab’s dexterous hand/arm to quickly prototype motions like pouring, stirring, dusting, squeezing, picking, spreading, tamping, hammering, writing, erasing, whisking.
    • Tools/Products/Workflows:
    • “Video-to-Servo” tool that outputs robot joint trajectories and arm IK waypoints; one-click deployment to a bimanual setup (e.g., Sharpa Wave + UR3e).
    • Motion library builder indexed by task/verb (ref. the 20 verbs demonstrated in the paper).
    • Assumptions/Dependencies:
    • Manual initial pose alignment to workspace (x, y, z, yaw) still needed.
    • Robustness enhanced by warmup/perturbations, but task reliability varies with video quality.
  • Physics-aware retargeting for existing reference datasets
    • Sector: Robotics, Academia
    • What: Use the retargeting module (MPPI + warmup + random force perturbation + transition reward) to turn MoCap or reconstructed references into physically stable trajectories, improving success vs. vanilla annealed sampling.
    • Tools/Products/Workflows:
    • Drop-in retargeting backend for open datasets (e.g., OakInk2) to create higher-quality robot demonstrations.
    • Benchmarking suite comparing kinematic-only vs. physics-aware retargeting.
    • Assumptions/Dependencies:
    • Requires a physics simulator and contact models; performance depends on friction/contact fidelity.
  • Video curation and QA using the “efficacy playbook”
    • Sector: Data Ops for Robotics, Policy in Organizations (governance of training data)
    • What: Apply the paper’s data filtering guidance to identify the ~5% of clips suitable for dexterous learning, reducing waste in data acquisition and labeling.
    • Tools/Products/Workflows:
    • Automated pre-filtering (shot boundary detection, visibility checks, motion/occlusion metrics) and human-in-the-loop QA.
    • Curatorial dashboards for dataset managers.
    • Assumptions/Dependencies:
    • Quality thresholds depend on segmentation/depth trackers; improves as foundation models improve.
    • Compliance with licensing/consent for internet videos.
  • Synthetic data generation from generative videos
    • Sector: Robotics, Software/Content
    • What: Leverage outputs from generative video models (shown supported) to seed novel dexterous trajectories and expand rare behaviors.
    • Tools/Products/Workflows:
    • “Gen-to-Robot” bridge: curate synthetic clips, reconstruct and retarget to create task diversity.
    • Assumptions/Dependencies:
    • Synthetic video quality must preserve hand-object geometry sufficiently for reconstruction.
    • Biases/artifacts in synthetic content may impact downstream policies.
  • Benchmarking and method development in academia
    • Sector: Academia (CV, RL, Control), Education
    • What: Use the modular pipeline (SAM 3D guided tracking + HaWoR + MPPI retargeting) as a baseline to study hand-object reconstruction, generative tracking, dynamics-aware retargeting, and sim-to-real.
    • Tools/Products/Workflows:
    • Public code modules for each stage; ablation-friendly experimentation.
    • Assignment kits to teach 4D reconstruction and retargeting.
    • Assumptions/Dependencies:
    • Availability of GPUs and simulators; open weights or accessible APIs for the foundation models.
  • Reducing teleoperation dependence in industrial R&D
    • Sector: Manufacturing & Logistics R&D, Robotics vendors
    • What: Mine internal training or CCTV-like videos (with proper consent) to bootstrap dexterous skills for rigid-object manipulation, tool handling, assembly primitives, and quality checks without heavy teleop investment.
    • Tools/Products/Workflows:
    • Internal “Do-as-I-Do” service on a robotics cloud; connectors to asset libraries and digital twins.
    • Assumptions/Dependencies:
    • Rigid-object focus; scene constraints not modeled; repeatability requires consistent camera setups or manual alignment.

Long-Term Applications

These use cases require further research, scaling, or productization beyond the current paper’s constraints (notably rigid-object and monocular limitations, scene understanding, and sim fidelity).

  • Household robots that learn new tasks from user-shot videos
    • Sector: Consumer Robotics, Smart Home
    • What: A phone/AR-glasses app where users record a task demonstration; the system reconstructs and retargets for an in-home robot to replicate personalized routines (e.g., specific kitchen or cleaning tasks).
    • Tools/Products/Workflows:
    • On-device capture + cloud reconstruction/retargeting + home robot deployment workflow; automatic workspace calibration.
    • Assumptions/Dependencies:
    • Needs robust scene reconstruction, articulated environments, deformable object handling, and safe autonomy in clutter; strong privacy controls for in-home video.
  • Industrial “video-first” skill acquisition at production scale
    • Sector: Manufacturing, Energy, Field Service
    • What: Replace MoCap-heavy pipelines with monocular video-based knowledge capture of skilled operators for assembly, maintenance, and tool-use across diverse plants (e.g., valve turning, wiring, fastening).
    • Tools/Products/Workflows:
    • Plant-wide video ingestion, automatic scene graph reconstruction, robot-specific retargeting, simulation-in-the-loop verification, and safety certification workflows.
    • Assumptions/Dependencies:
    • Requires reliable scene-level reasoning, contact-rich accuracy, compliance with safety/regulatory standards, and integration with plant digital twins.
  • Surgical and medical dexterity learned from procedure videos
    • Sector: Healthcare (Surgical Robotics, Rehabilitation)
    • What: Learn fine motor skills from endoscopic/operating room videos for training simulators, robot assistance, or rehabilitation devices.
    • Tools/Products/Workflows:
    • Video-to-simulator curriculum; haptic training modules for clinicians; assistive policies for robotic end-effectors.
    • Assumptions/Dependencies:
    • Extension to non-rigid, fluid, and soft-tissue dynamics; high-precision contact modeling; stringent clinical safety and data governance.
  • Real-time “video-to-teleop” retargeting without specialized rigs
    • Sector: Robotics, Remote Operations
    • What: Convert a monocular operator video feed (e.g., from smart glasses) into a live retargeting signal to control a remote dexterous robot, reducing hardware needs for teleop.
    • Tools/Products/Workflows:
    • Low-latency online reconstruction + predictive retargeting; closed-loop visual feedback and correction.
    • Assumptions/Dependencies:
    • Demands robust and fast 4D tracking under occlusion/blur, low-latency networks, and strong safety interlocks.
  • Generalist dexterous agents trained at internet scale
    • Sector: Robotics, AI Platforms
    • What: Combine internet-scale video reconstruction with world models/VLA models to learn broadly generalizable dexterous policies capable of tool use and multi-task execution.
    • Tools/Products/Workflows:
    • Data lake of reconstructed trajectories; joint training with visuomotor diffusion policies and world models; continual learning pipelines.
    • Assumptions/Dependencies:
    • Large-scale compute; better handling of task semantics, long horizons, and cross-object generalization; robust sim-to-real.
  • Knowledge preservation for skilled trades via “learning from masters”
    • Sector: Workforce Development, Industry Associations
    • What: Capture expert craft demonstrations (e.g., machining, artisan work) via conventional cameras and translate them into robot-executable knowledge bases for training or assistance.
    • Tools/Products/Workflows:
    • Curated capture protocols; video-to-skill libraries linked to shop-floor robots or mixed-reality training tools.
    • Assumptions/Dependencies:
    • Handling complex tool interactions, deformables, and nuanced contact strategies; IP/consent frameworks with unions and employers.
  • Scene-aware task execution with obstacle/articulation constraints
    • Sector: Robotics (Home, Warehouse, Service)
    • What: Extend from hand+object to full-scene reconstruction to reason about obstacles, articulated objects (doors, drawers), and task constraints for safe, reliable execution.
    • Tools/Products/Workflows:
    • Scene graph reconstruction and dynamic articulation modeling integrated with retargeting; hybrid planning + control.
    • Assumptions/Dependencies:
    • Advances in monocular scene understanding, constraint reasoning, and contact-rich planning.
  • Robust dexterity with deformable objects and fluids
    • Sector: Food Services, Healthcare, Manufacturing, Home
    • What: Manipulate cloth, food, cables, or liquids by learning from human videos (e.g., folding, kneading, pouring with fluid dynamics).
    • Tools/Products/Workflows:
    • Extended simulators with deformable/viscoelastic models; perception for deformables; data augmentation with synthetic videos.
    • Assumptions/Dependencies:
    • Accurate non-rigid modeling and tracking; improved perception of contact vs. occlusion.
  • Standard-setting and governance for video-sourced robot training
    • Sector: Policy/Regulation, Standards Bodies, Legal
    • What: Develop guidelines for consent, copyright, privacy, bias, and safety when using internet or workplace videos to train robots.
    • Tools/Products/Workflows:
    • Auditable data pipelines; provenance tracking; bias and safety audits; opt-in consent frameworks.
    • Assumptions/Dependencies:
    • Multi-stakeholder coordination; evolving legal landscape around AI training data.

Cross-cutting assumptions/dependencies to monitor

  • Algorithmic: Current method assumes rigid objects and semi-accurate monocular metric depth; monocular ambiguity around contact vs. occlusion; no full-scene modeling.
  • Hardware: Availability of dexterous multi-fingered hands and reliable arms; high-rate control; tactile sensing (optional but beneficial).
  • Simulation: Contact/friction fidelity and domain randomization affect sim-to-real; warmup/perturbation/transition reward improve robustness but do not replace detailed physics.
  • Data: Quality of segmentation and tracking; adherence to the data filtering playbook boosts utility; licensing/consent for internet videos is essential.
  • Operations: Initial manual pose alignment to workspace remains in the loop for real deployments; automated calibration is a future need.

Glossary

  • 6-DoF: Six degrees of freedom describing a rigid body's 3D position and orientation; often used for object pose estimation and tracking. "model-based 6-DoF trackers [17, 47]"
  • Affordances: Action possibilities inferred from visual inputs that indicate how objects can be used or manipulated. "such as affordances [36, 37]"
  • Annealed Sampling: An optimization strategy that gradually reduces exploration noise or kernel bandwidth during iterative sampling. "SPIDER serves as the Annealed Sampling baseline"
  • Bimanual: Involving two hands (or two robotic end-effectors/arms) coordinating on a task. "Real-world deployment results shown here are on a bimanual setup with Sharpa Wave hands and UR3e arms"
  • Chamfer distance (CD): A metric measuring the distance between two point sets, commonly used to compare reconstructed 3D shapes. "Chamfer distance (CD)"
  • Dynamics-aware retargeting: Mapping human motions to robots while explicitly accounting for physical interactions, forces, and contact stability in simulation. "DO AS I DO instead performs dynamics-aware retargeting, which follows the reference while ensur- ing realism within physics simulation."
  • Egocentric: First-person viewpoint captured from the actor’s perspective (e.g., head- or body-mounted camera). "egocentric and exocentric in-the-wild video sources."
  • Exocentric: Third-person viewpoint captured from an external observer’s perspective. "egocentric and exocentric in-the-wild video sources."
  • Flow matching: A generative modeling and inference paradigm that aligns probability flows, enabling sampling by integrating learned velocity fields. "we exploit the flow matching inference itself"
  • Generative foundation model: A large pre-trained generative model with broad priors that can generalize across diverse inputs and tasks. "image- conditioned 3D generative foundation models [46, 23, 11]"
  • Geodesic angle: The shortest-angle distance on a curved manifold; for rotations, the minimal rotation angle between orientations. "the geodesic angle on SO(3)."
  • Guided diffusion: Diffusion-based generation steered by guidance signals or targets during sampling to achieve desired outcomes. "Object Tracking via Guided Diffusion."
  • In-the-wild: Uncontrolled, real-world data with diverse conditions, as opposed to curated lab settings. "in-the-wild video sources."
  • Intersection over Union (IoU): An overlap metric for comparing predicted and ground-truth masks or boxes. "consensus filtering and mask-IoU recovers the mode-best pose"
  • Inverse kinematics (IK): Computing joint configurations that achieve a desired end-effector pose. "before computing arm IK and deploying in the real world."
  • Kernel annealing: Gradually shrinking a sampling kernel’s scale over iterations/horizons to transition from exploration to refinement. "with a kernel annealed across both iterations and the prediction horizon,"
  • Latent space: The internal representation space learned by a model where high-level factors (e.g., shape, pose) are encoded. "shape and pose share the same latent space"
  • Model Predictive Path Integral (MPPI): A sampling-based optimal control method that uses stochastic rollouts to optimize actions. "we perform an MPPI-style sampling-based optimization"
  • MoCap: Motion capture; systems and data capturing precise human or object movements. "ground-truth hand-object poses from, e.g., MoCap."
  • Monocular RGB: Single-camera color imagery without depth, used for perception and reconstruction. "monocular RGB videos of humans"
  • Ordinary Differential Equation (ODE): A continuous-time equation describing dynamics; integrated here to sample from flow models. "by integrating the ODE backward from t=1 to t=0."
  • Point tracks: Sequences of 2D points linked across frames to estimate motion (e.g., for pose guidance). "2D point tracks [67]"
  • Quaternion: A 4D representation for 3D rotations that avoids singularities of Euler angles. "unit-quaternion components"
  • Random Force Perturbation: Injecting random forces during simulated rollouts to promote robustness and avoid brittle solutions. "Random Force Perturbation."
  • Sampling-based optimization: Optimizing controls by sampling candidate action sequences and selecting/refining the best performers. "retargets them onto the robot via sampling-based optimization in simulation."
  • SE(3): The Lie group of 3D rigid motions (translations and rotations) used to represent full object/hand poses. "weighted SE(3) distance."
  • Sim-to-real: Transferring policies or behaviors learned in simulation to real-world robots, often using robustness techniques. "drawing inspiration from sim-to-real [69, 70]"
  • SO(3): The Lie group of 3D rotations used to represent orientations. "the geodesic angle on SO(3)."
  • Teleoperation: Remotely controlling a robot by a human operator, often via specialized interfaces. "Teleoperation is bottlenecked by operator expertise, cost of operation, and mechanical transparency of the teleoperation rig."
  • Transition Reward: An additional reward term encouraging correct state/contact transitions (e.g., pick/place) in retargeting. "Transition Reward."
  • Warmup steps: An initial rollout phase that lets the system stabilize (e.g., achieve a grasp) before tracking the reference trajectory. "Warmup Steps."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 905 likes about this paper.