Do as I Do: Dexterous Manipulation Data from Everyday Human Videos
Abstract: How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about teaching robot hands to do tricky, human-like tasks (like whisking, pouring, writing, or picking up objects) just by watching everyday videos of people. The system they built is called DO AS I DO. It takes a normal phone-style video (one camera, regular color frames) of someone using their hands and turns it into step-by-step robot actions that a real robot hand can perform.
What questions does the paper ask?
- Can we turn the huge number of regular human videos online into useful training data for robots?
- Can we figure out, from just a video, where the hand and the object are in 3D (how they move, rotate, and touch)?
- Can we “retarget” those human motions onto a very different robot hand so the robot can actually do the same task, safely and reliably?
How does it work? (Methods in everyday language)
Think of DO AS I DO as a two-part process: first “understand the video,” then “make the robot do it.”
Part 1: Understanding the video (Reconstruction)
- Goal: Recreate what happened in 3D—where the person’s hand and the object were and how they moved over time.
- Inputs: A normal video (no special sensors), frame by frame.
What they do:
- Track the hand: They use a hand-tracking model to estimate the 3D position and shape of the human hand across frames.
- Find and model the object: They use smart vision tools to segment the object (separate it from the background) and to “inflate” a single image of it into a 3D shape (like turning a photo into a simple 3D toy model).
- Follow the object through time: They use a “guided diffusion” tracker. Imagine starting with a blurry guess for the object’s position each frame and slowly sharpening it, while gently nudging it to stay consistent with the previous frame so it doesn’t “jump” around. They also estimate how fast the object seems to rotate in 2D to tune how strong that nudge should be.
- Pick a stable pose each frame: They sample several possible object positions/orientations for that frame and choose the best one by finding a consensus (the one most samples agree on), which is fast and reliable.
- Line up scales: Because hand and object are reconstructed separately, they have to align their sizes and positions. They do this using estimated depth (how far things are) and make sure the whole scene lines up with gravity, so “down” in the video is also “down” for the robot.
Key terms explained:
- Monocular RGB video: A regular color video from one camera (like a phone) without any depth sensor.
- 3D reconstruction: Estimating a 3D shape and position from 2D images—like guessing the shape of a cup from a photo.
- Diffusion model (here, guided diffusion): A method that starts from noise and refines a guess step by step into a clean, consistent result.
Part 2: Making the robot do it (Retargeting)
- Goal: Convert the human hand/object motion into robot hand actions that actually work in the real world.
- Challenge: Human hands and robot hands are very different (finger lengths, joints). Simply copying angles won’t work, and it might cause the robot to drop or poke through the object.
What they do:
- Physics-aware optimization: They try lots of small variations of robot movements inside a fast physics simulator and pick the ones that best match the target motion while obeying real-world rules (no ghosting through objects, stable grasps).
- Three practical upgrades that make this robust to messy video estimates:
- Warmup steps: The robot gets a short “practice” phase before the main motion starts. The object is temporarily held in place so the robot can adjust its pose and get a good grasp rather than starting in a bad position and failing right away.
- Random force bumps: They add tiny random pushes in simulation (like gentle shoves) so the robot learns grasps that can handle small disturbances, which helps in the real world too.
- Transition reward: They add a bonus/penalty around key moments—like when the object should be on the table vs. in the hand—so the robot clearly learns to pick up and put down rather than hovering awkwardly.
Analogy:
- Imagine learning a dance from a video. First, you figure out the steps (reconstruction). Then you practice the moves on your own body, adjusting for your height and flexibility (retargeting), and you try on a stage that mimics real physics (no floating!). Warmup is your rehearsal; random bumps are like practicing with small distractions; transition rewards are cues to hit your marks.
Main findings and why they matter
The authors tested both parts carefully:
- Better 3D tracking from regular videos:
- On standard benchmarks, their method outperformed previous top approaches at recovering how objects move with hands in 3D.
- On 150 “in-the-wild” videos (messy, real-life clips), humans preferred their object tracking over a strong baseline 67% of the time. This means it stays locked onto the object more reliably despite blur or occlusion.
- Stronger robot retargeting:
- On challenging, reconstructed video references, their full method achieved a 71% success rate (vs. 25% for a strong baseline without their upgrades).
- On a big, clean motion-capture dataset, they improved success from 72% to 81%—showing that the method scales and helps even when the reference is already good.
- Real-world demos:
- They produced 500 high-quality robot trajectories from internet, egocentric, and AI-generated videos.
- They ran many tasks on real robot hands and arms (like whisking, pouring, writing, picking), proving the pipeline can go from “video on the internet” to “robot actually doing it.”
Why this matters:
- Most human know-how is available as videos. Turning those videos into real robot skills could massively speed up how robots learn—especially for complex, finger-based tasks.
Implications and impact
- Scaling robot learning: If robots can learn from regular videos, we can build huge training sets without expensive motion-capture labs or teleoperation sessions.
- Practical advice for data collection: The authors found that only around 4–5% of random “hands interacting with objects” web clips are truly useful for dexterous learning. Careful filtering and curation are critical.
- Current limits:
- Works best with rigid objects (not squishy or flexible).
- Estimating exact contact from a single camera is hard—sometimes what looks like touching is just overlapping in 2D.
- It doesn’t reconstruct the whole room or all obstacles.
- Simulators aren’t perfect, so there’s still a gap to the real world.
Big picture:
- DO AS I DO is a promising step toward robots that can watch us do everyday tasks and learn to do them too. With better video models, richer scene understanding, and improved simulators, this approach could help robots handle a much wider range of real-life activities.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what remains missing, uncertain, or left unexplored, framed to guide actionable future research.
- Reconstruction: reliance on monocular metric depth
- How to self-calibrate absolute scale robustly from monocular RGB when MoGe’s metric depth is inaccurate, especially under strong parallax, rolling shutter, or unknown intrinsics?
- Can scene priors (e.g., human hand anthropometrics, gravity constraints, furniture dimensions) be used to regularize or correct absolute scale?
- Contact inference from monocular video
- How to disambiguate true physical contact vs. visual occlusion from RGB alone?
- Can physics-informed priors, uncertainty-aware tracking, or multi-modal cues (audio, IMU) enable reliable contact state estimation and timing?
- Rigid-object assumption
- Extension to articulated (e.g., scissors, lids), deformable (cloth, sponges), and fluid-containing objects (pouring) remains open; what object representations and simulators are needed?
- How to model within-video object state changes (e.g., deformation, liquid level) and retarget them physically?
- Single-object, partial-scene modeling
- The method reconstructs only the hand and a single object; how to reconstruct full scenes (surfaces, obstacles, containers, articulated furniture) and reason about hand-scene and object-scene constraints?
- How to scale to multi-object interactions, tool use, and tasks requiring target objects (e.g., hammer-and-nail, writing on paper)?
- Temporal consistency and drift
- Pose tracking is per-frame guided diffusion with shape fixed at an anchor frame; can a global, temporally coherent optimization (e.g., bundle-adjustment over SE(3) and shape) reduce drift and re-acquire under long occlusions?
- How to jointly optimize hand and object trajectories over time, rather than modular post-hoc alignment?
- Physical parameter estimation
- Mass, inertia, and friction are not inferred; how to estimate or identify object/material physical properties from video to improve simulation fidelity and grasp stability?
- Uncertainty quantification and propagation
- The tracker selects a single pose sample; how to represent and propagate pose uncertainty (e.g., distributions over SE(3)) into retargeting, planning, and safety checks?
- Robustness under adverse imaging conditions
- Performance under severe motion blur, specular/transparent objects, tiny objects, extreme occlusions, and rapid egocentric motion is not characterized; what robustness interventions (e.g., temporal super-resolution, reflection-aware priors) help?
- Bimanual reconstruction at scale
- Although retargeting is evaluated on bimanual MoCap, in-the-wild reconstruction of two hands interacting with one or more objects is not demonstrated; how to robustly reconstruct and retarget hand–hand–object interactions?
- Coupled arm–hand retargeting
- Retargeting optimizes the hand in simulation and then applies arm IK post hoc; how to jointly optimize arm and hand (whole-body) under environmental constraints to avoid collisions and exploit arm contacts?
- Contact-aware retargeting objectives
- Current objectives emphasize pose/rotation tracking with simple transition penalties; how to incorporate explicit, learned or physics-derived contact goals (contact locations, normal forces, frictional stability, no-slip constraints)?
- Warmup and welded-object heuristic
- The warmup strategy “holds” the object to recover from poor initial states; can contact timing and grasp initiation be inferred directly from video and enforced without non-physical welds?
- Task-level success beyond pose error
- Success is defined by mean position/rotation error; how to evaluate semantic task completion (e.g., liquid transfer angle/flow, paint coverage, screw insertion), grasp robustness, and force stability in sim and real?
- Real-world evaluation scale and repeatability
- Real deployments show 10 qualitative tasks without systematic success rates or variability analysis; can large-scale, repeatable real-world benchmarks (dozens of tasks × trials) quantify reliability and failure modes?
- Automatic video-to-robot world registration
- Manual (x, y, z, yaw) alignment is required; how to automatically localize and register the reconstructed scene to the robot workspace (e.g., via fiducials, object recognition, SLAM) for fully hands-off deployment?
- Data utility for policy learning
- The paper generates 500 trajectories but does not measure downstream learning gains; do policies trained on this data outperform teleoperation/simulation baselines on held-out tasks and objects?
- Data triage and quality control at scale
- The “playbook” identifies heavy attrition in internet clips, but no automated filtering is provided; can scalable, model-based triage (quality scoring, contact visibility, action detectability) be built to preselect useful videos?
- Computational efficiency and throughput
- Guided diffusion with per-frame sampling and clustering is compute-heavy; what temporal models or amortized trackers can cut cost by 10–100× for million-scale video ingestion?
- Generalization across robot embodiments
- Results are shown on Sharpa Wave + UR3e; how well does the retargeter transfer to underactuated, tendon-driven, or low-DoF hands and different arms without re-tuning?
- Multi-stage, long-horizon tasks
- The approach uses a simple distance threshold for “rest” vs. “in-hand”; how to infer multi-stage task graphs (subgoals, regrasp events, tool switching) and retarget them hierarchically over long horizons?
- Closed-loop, real-time execution
- The pipeline produces open-loop trajectories; how to integrate online perception and tactile feedback for closed-loop corrections, recovery from slippage, and robust execution under disturbances?
- Evaluation on generated videos
- Although claimed to handle generative videos, there is no systematic analysis of generation artifacts, domain gaps, or their impact on retargeting; what criteria and diagnostics are needed before using synthetic clips?
- Benchmarks for in-the-wild ground truth
- In-the-wild evaluation relies on human preference; can semi-synthetic datasets with accurate ground truth (e.g., photorealistic renders with known states) be created to benchmark reconstruction/retargeting under realistic degradations?
- Handling fast camera cuts and shot boundaries
- Clips with shot transitions fail; how to detect and bridge across cuts (e.g., track re-identification, cross-shot shape/pose optimization) to recover longer, edited demonstrations?
- Safety and failure handling
- What safety monitors and predictive checks (e.g., uncertainty thresholds, collision risk, torque/force limits) are needed to prevent hardware damage when deploying trajectories derived from noisy reconstructions?
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that can leverage the paper’s method today, together with relevant sectors, potential tools/products/workflows, and feasibility notes.
- Scalable dataset generation for dexterous manipulation from existing human videos
- Sector: Robotics, Software (ML Ops for robotics), Education
- What: Convert large stores of monocular RGB videos (egocentric/exocentric/internet clips) into robot-executable hand-object trajectories for training imitation and control policies, reducing dependence on teleoperation and MoCap.
- Tools/Products/Workflows:
- “Do-As-I-Do Dataset Builder” pipeline integrated with GPU simulators (MuJoCo Warp, Isaac).
- Plugins for ROS/Isaac Sim/MuJoCo to ingest videos and export retargeted action sequences.
- Course/lab kit for universities to turn YouTube or egocentric videos into dexterous control datasets.
- Assumptions/Dependencies:
- Rigid objects; semi-accurate monocular depth (MoGe) and segmentations (SAM 3).
- Access to vision foundation models (SAM 3D, HaWoR) and GPU compute.
- Physics simulators available; fidelity limitations may cap real-world success.
- Rapid prototyping of robot demos from online videos
- Sector: Robotics (R&D labs, startups), Marketing/Demos
- What: Extract a reference from an instructive video and retarget to a lab’s dexterous hand/arm to quickly prototype motions like pouring, stirring, dusting, squeezing, picking, spreading, tamping, hammering, writing, erasing, whisking.
- Tools/Products/Workflows:
- “Video-to-Servo” tool that outputs robot joint trajectories and arm IK waypoints; one-click deployment to a bimanual setup (e.g., Sharpa Wave + UR3e).
- Motion library builder indexed by task/verb (ref. the 20 verbs demonstrated in the paper).
- Assumptions/Dependencies:
- Manual initial pose alignment to workspace (x, y, z, yaw) still needed.
- Robustness enhanced by warmup/perturbations, but task reliability varies with video quality.
- Physics-aware retargeting for existing reference datasets
- Sector: Robotics, Academia
- What: Use the retargeting module (MPPI + warmup + random force perturbation + transition reward) to turn MoCap or reconstructed references into physically stable trajectories, improving success vs. vanilla annealed sampling.
- Tools/Products/Workflows:
- Drop-in retargeting backend for open datasets (e.g., OakInk2) to create higher-quality robot demonstrations.
- Benchmarking suite comparing kinematic-only vs. physics-aware retargeting.
- Assumptions/Dependencies:
- Requires a physics simulator and contact models; performance depends on friction/contact fidelity.
- Video curation and QA using the “efficacy playbook”
- Sector: Data Ops for Robotics, Policy in Organizations (governance of training data)
- What: Apply the paper’s data filtering guidance to identify the ~5% of clips suitable for dexterous learning, reducing waste in data acquisition and labeling.
- Tools/Products/Workflows:
- Automated pre-filtering (shot boundary detection, visibility checks, motion/occlusion metrics) and human-in-the-loop QA.
- Curatorial dashboards for dataset managers.
- Assumptions/Dependencies:
- Quality thresholds depend on segmentation/depth trackers; improves as foundation models improve.
- Compliance with licensing/consent for internet videos.
- Synthetic data generation from generative videos
- Sector: Robotics, Software/Content
- What: Leverage outputs from generative video models (shown supported) to seed novel dexterous trajectories and expand rare behaviors.
- Tools/Products/Workflows:
- “Gen-to-Robot” bridge: curate synthetic clips, reconstruct and retarget to create task diversity.
- Assumptions/Dependencies:
- Synthetic video quality must preserve hand-object geometry sufficiently for reconstruction.
- Biases/artifacts in synthetic content may impact downstream policies.
- Benchmarking and method development in academia
- Sector: Academia (CV, RL, Control), Education
- What: Use the modular pipeline (SAM 3D guided tracking + HaWoR + MPPI retargeting) as a baseline to study hand-object reconstruction, generative tracking, dynamics-aware retargeting, and sim-to-real.
- Tools/Products/Workflows:
- Public code modules for each stage; ablation-friendly experimentation.
- Assignment kits to teach 4D reconstruction and retargeting.
- Assumptions/Dependencies:
- Availability of GPUs and simulators; open weights or accessible APIs for the foundation models.
- Reducing teleoperation dependence in industrial R&D
- Sector: Manufacturing & Logistics R&D, Robotics vendors
- What: Mine internal training or CCTV-like videos (with proper consent) to bootstrap dexterous skills for rigid-object manipulation, tool handling, assembly primitives, and quality checks without heavy teleop investment.
- Tools/Products/Workflows:
- Internal “Do-as-I-Do” service on a robotics cloud; connectors to asset libraries and digital twins.
- Assumptions/Dependencies:
- Rigid-object focus; scene constraints not modeled; repeatability requires consistent camera setups or manual alignment.
Long-Term Applications
These use cases require further research, scaling, or productization beyond the current paper’s constraints (notably rigid-object and monocular limitations, scene understanding, and sim fidelity).
- Household robots that learn new tasks from user-shot videos
- Sector: Consumer Robotics, Smart Home
- What: A phone/AR-glasses app where users record a task demonstration; the system reconstructs and retargets for an in-home robot to replicate personalized routines (e.g., specific kitchen or cleaning tasks).
- Tools/Products/Workflows:
- On-device capture + cloud reconstruction/retargeting + home robot deployment workflow; automatic workspace calibration.
- Assumptions/Dependencies:
- Needs robust scene reconstruction, articulated environments, deformable object handling, and safe autonomy in clutter; strong privacy controls for in-home video.
- Industrial “video-first” skill acquisition at production scale
- Sector: Manufacturing, Energy, Field Service
- What: Replace MoCap-heavy pipelines with monocular video-based knowledge capture of skilled operators for assembly, maintenance, and tool-use across diverse plants (e.g., valve turning, wiring, fastening).
- Tools/Products/Workflows:
- Plant-wide video ingestion, automatic scene graph reconstruction, robot-specific retargeting, simulation-in-the-loop verification, and safety certification workflows.
- Assumptions/Dependencies:
- Requires reliable scene-level reasoning, contact-rich accuracy, compliance with safety/regulatory standards, and integration with plant digital twins.
- Surgical and medical dexterity learned from procedure videos
- Sector: Healthcare (Surgical Robotics, Rehabilitation)
- What: Learn fine motor skills from endoscopic/operating room videos for training simulators, robot assistance, or rehabilitation devices.
- Tools/Products/Workflows:
- Video-to-simulator curriculum; haptic training modules for clinicians; assistive policies for robotic end-effectors.
- Assumptions/Dependencies:
- Extension to non-rigid, fluid, and soft-tissue dynamics; high-precision contact modeling; stringent clinical safety and data governance.
- Real-time “video-to-teleop” retargeting without specialized rigs
- Sector: Robotics, Remote Operations
- What: Convert a monocular operator video feed (e.g., from smart glasses) into a live retargeting signal to control a remote dexterous robot, reducing hardware needs for teleop.
- Tools/Products/Workflows:
- Low-latency online reconstruction + predictive retargeting; closed-loop visual feedback and correction.
- Assumptions/Dependencies:
- Demands robust and fast 4D tracking under occlusion/blur, low-latency networks, and strong safety interlocks.
- Generalist dexterous agents trained at internet scale
- Sector: Robotics, AI Platforms
- What: Combine internet-scale video reconstruction with world models/VLA models to learn broadly generalizable dexterous policies capable of tool use and multi-task execution.
- Tools/Products/Workflows:
- Data lake of reconstructed trajectories; joint training with visuomotor diffusion policies and world models; continual learning pipelines.
- Assumptions/Dependencies:
- Large-scale compute; better handling of task semantics, long horizons, and cross-object generalization; robust sim-to-real.
- Knowledge preservation for skilled trades via “learning from masters”
- Sector: Workforce Development, Industry Associations
- What: Capture expert craft demonstrations (e.g., machining, artisan work) via conventional cameras and translate them into robot-executable knowledge bases for training or assistance.
- Tools/Products/Workflows:
- Curated capture protocols; video-to-skill libraries linked to shop-floor robots or mixed-reality training tools.
- Assumptions/Dependencies:
- Handling complex tool interactions, deformables, and nuanced contact strategies; IP/consent frameworks with unions and employers.
- Scene-aware task execution with obstacle/articulation constraints
- Sector: Robotics (Home, Warehouse, Service)
- What: Extend from hand+object to full-scene reconstruction to reason about obstacles, articulated objects (doors, drawers), and task constraints for safe, reliable execution.
- Tools/Products/Workflows:
- Scene graph reconstruction and dynamic articulation modeling integrated with retargeting; hybrid planning + control.
- Assumptions/Dependencies:
- Advances in monocular scene understanding, constraint reasoning, and contact-rich planning.
- Robust dexterity with deformable objects and fluids
- Sector: Food Services, Healthcare, Manufacturing, Home
- What: Manipulate cloth, food, cables, or liquids by learning from human videos (e.g., folding, kneading, pouring with fluid dynamics).
- Tools/Products/Workflows:
- Extended simulators with deformable/viscoelastic models; perception for deformables; data augmentation with synthetic videos.
- Assumptions/Dependencies:
- Accurate non-rigid modeling and tracking; improved perception of contact vs. occlusion.
- Standard-setting and governance for video-sourced robot training
- Sector: Policy/Regulation, Standards Bodies, Legal
- What: Develop guidelines for consent, copyright, privacy, bias, and safety when using internet or workplace videos to train robots.
- Tools/Products/Workflows:
- Auditable data pipelines; provenance tracking; bias and safety audits; opt-in consent frameworks.
- Assumptions/Dependencies:
- Multi-stakeholder coordination; evolving legal landscape around AI training data.
Cross-cutting assumptions/dependencies to monitor
- Algorithmic: Current method assumes rigid objects and semi-accurate monocular metric depth; monocular ambiguity around contact vs. occlusion; no full-scene modeling.
- Hardware: Availability of dexterous multi-fingered hands and reliable arms; high-rate control; tactile sensing (optional but beneficial).
- Simulation: Contact/friction fidelity and domain randomization affect sim-to-real; warmup/perturbation/transition reward improve robustness but do not replace detailed physics.
- Data: Quality of segmentation and tracking; adherence to the data filtering playbook boosts utility; licensing/consent for internet videos is essential.
- Operations: Initial manual pose alignment to workspace remains in the loop for real deployments; automated calibration is a future need.
Glossary
- 6-DoF: Six degrees of freedom describing a rigid body's 3D position and orientation; often used for object pose estimation and tracking. "model-based 6-DoF trackers [17, 47]"
- Affordances: Action possibilities inferred from visual inputs that indicate how objects can be used or manipulated. "such as affordances [36, 37]"
- Annealed Sampling: An optimization strategy that gradually reduces exploration noise or kernel bandwidth during iterative sampling. "SPIDER serves as the Annealed Sampling baseline"
- Bimanual: Involving two hands (or two robotic end-effectors/arms) coordinating on a task. "Real-world deployment results shown here are on a bimanual setup with Sharpa Wave hands and UR3e arms"
- Chamfer distance (CD): A metric measuring the distance between two point sets, commonly used to compare reconstructed 3D shapes. "Chamfer distance (CD)"
- Dynamics-aware retargeting: Mapping human motions to robots while explicitly accounting for physical interactions, forces, and contact stability in simulation. "DO AS I DO instead performs dynamics-aware retargeting, which follows the reference while ensur- ing realism within physics simulation."
- Egocentric: First-person viewpoint captured from the actor’s perspective (e.g., head- or body-mounted camera). "egocentric and exocentric in-the-wild video sources."
- Exocentric: Third-person viewpoint captured from an external observer’s perspective. "egocentric and exocentric in-the-wild video sources."
- Flow matching: A generative modeling and inference paradigm that aligns probability flows, enabling sampling by integrating learned velocity fields. "we exploit the flow matching inference itself"
- Generative foundation model: A large pre-trained generative model with broad priors that can generalize across diverse inputs and tasks. "image- conditioned 3D generative foundation models [46, 23, 11]"
- Geodesic angle: The shortest-angle distance on a curved manifold; for rotations, the minimal rotation angle between orientations. "the geodesic angle on SO(3)."
- Guided diffusion: Diffusion-based generation steered by guidance signals or targets during sampling to achieve desired outcomes. "Object Tracking via Guided Diffusion."
- In-the-wild: Uncontrolled, real-world data with diverse conditions, as opposed to curated lab settings. "in-the-wild video sources."
- Intersection over Union (IoU): An overlap metric for comparing predicted and ground-truth masks or boxes. "consensus filtering and mask-IoU recovers the mode-best pose"
- Inverse kinematics (IK): Computing joint configurations that achieve a desired end-effector pose. "before computing arm IK and deploying in the real world."
- Kernel annealing: Gradually shrinking a sampling kernel’s scale over iterations/horizons to transition from exploration to refinement. "with a kernel annealed across both iterations and the prediction horizon,"
- Latent space: The internal representation space learned by a model where high-level factors (e.g., shape, pose) are encoded. "shape and pose share the same latent space"
- Model Predictive Path Integral (MPPI): A sampling-based optimal control method that uses stochastic rollouts to optimize actions. "we perform an MPPI-style sampling-based optimization"
- MoCap: Motion capture; systems and data capturing precise human or object movements. "ground-truth hand-object poses from, e.g., MoCap."
- Monocular RGB: Single-camera color imagery without depth, used for perception and reconstruction. "monocular RGB videos of humans"
- Ordinary Differential Equation (ODE): A continuous-time equation describing dynamics; integrated here to sample from flow models. "by integrating the ODE backward from t=1 to t=0."
- Point tracks: Sequences of 2D points linked across frames to estimate motion (e.g., for pose guidance). "2D point tracks [67]"
- Quaternion: A 4D representation for 3D rotations that avoids singularities of Euler angles. "unit-quaternion components"
- Random Force Perturbation: Injecting random forces during simulated rollouts to promote robustness and avoid brittle solutions. "Random Force Perturbation."
- Sampling-based optimization: Optimizing controls by sampling candidate action sequences and selecting/refining the best performers. "retargets them onto the robot via sampling-based optimization in simulation."
- SE(3): The Lie group of 3D rigid motions (translations and rotations) used to represent full object/hand poses. "weighted SE(3) distance."
- Sim-to-real: Transferring policies or behaviors learned in simulation to real-world robots, often using robustness techniques. "drawing inspiration from sim-to-real [69, 70]"
- SO(3): The Lie group of 3D rotations used to represent orientations. "the geodesic angle on SO(3)."
- Teleoperation: Remotely controlling a robot by a human operator, often via specialized interfaces. "Teleoperation is bottlenecked by operator expertise, cost of operation, and mechanical transparency of the teleoperation rig."
- Transition Reward: An additional reward term encouraging correct state/contact transitions (e.g., pick/place) in retargeting. "Transition Reward."
- Warmup steps: An initial rollout phase that lets the system stabilize (e.g., achieve a grasp) before tracking the reference trajectory. "Warmup Steps."
Collections
Sign up for free to add this paper to one or more collections.