DexImit: Automated Bimanual Dexterous Imitation

Updated 4 July 2026

DexImit is an automated framework that reconstructs near-metric 4D hand-object interactions from monocular videos to generate robot demonstration data.
The framework employs a four-stage pipeline—reconstruction, scheduling, synthesis, and augmentation—to bridge the human-robot embodiment gap.
Zero-shot deployment is achieved by training a 3D Diffusion Policy on physics-aware, augmented demonstrations with structured bimanual coordination.

Searching arXiv for DexImit and closely related dexterous imitation / sim-to-real papers to ground the article with current citations. DexImit is an automated framework for learning bimanual dexterous manipulation from monocular human videos by converting those videos into physically plausible robot data and then using the generated data to train policies for zero-shot real-world deployment (Mu et al., 10 Feb 2026). It is motivated by the claim that data scarcity fundamentally limits the generalization of bimanual dexterous manipulation, while human manipulation videos are a direct carrier of manipulation knowledge and are available at much larger scale than robot demonstrations. The central technical problem is the embodiment gap between human hands and robotic dexterous hands: human videos contain rich manipulation structure, but human appearance, kinematics, geometry, and action spaces are not directly executable on robot hardware. DexImit addresses this by reconstructing near-metric 4D hand-object interactions from monocular video, decomposing the task into subtasks with bimanual scheduling, synthesizing robot grasps and motions consistent with the demonstrated interactions, and augmenting the resulting robot trajectories to train a 3D diffusion policy for real deployment (Mu et al., 10 Feb 2026).

1. Problem formulation and conceptual positioning

DexImit addresses the data bottleneck in bimanual dexterous manipulation. The difficulty is not merely the absence of demonstrations, but the interaction of three factors: dexterous hands are highly articulated, bimanual coordination multiplies the space of feasible interactions, and collecting large robot datasets through teleoperation is expensive and labor-intensive. The framework therefore treats human manipulation videos, including videos from the Internet and video generation models, as a scalable supervisory source (Mu et al., 10 Feb 2026).

A common misconception is that DexImit performs direct human-to-robot behavior cloning from pixels or human joint angles. It does not. The formulation in DexImit is explicitly object- and interaction-centric: the robot imitates the induced object motions and hand-object relationships rather than directly copying human joint trajectories. The paper argues that direct human-video pretraining struggles because of both a visual embodiment gap and an action embodiment gap, whereas reconstructing the underlying 3D hand-object interaction yields an embodiment-agnostic reference that can later be converted into robot-consistent motions (Mu et al., 10 Feb 2026).

This positioning places DexImit within dexterous imitation learning, but outside teleoperation-first pipelines. Open TeleDex, for example, is designed as a hardware-agnostic teleoperation and data collection framework for dexterous imitation learning, with synchronized robot trajectories and multi-modal recordings as its output (Chi et al., 16 Oct 2025). DexImit instead starts from ordinary monocular human videos and removes the need for task-specific robot demonstrations by reconstructing interaction geometry and synthesizing robot data offline (Mu et al., 10 Feb 2026).

2. Four-stage generation pipeline

DexImit follows a four-stage pipeline: reconstruction, scheduling, robot trajectory synthesis, and augmentation. The input is a monocular RGB video

$V=\{I_i\}_{i=0}^K$

from an arbitrary viewpoint, without depth or camera intrinsics. The reconstruction stage outputs per-frame, near-metric object poses $\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ , per-frame hand poses $\{p_h^t\}_{t=0}^{K_t}$ , and world-frame trajectories through a camera-to-world transform $\mathbf{T}_{c\rightarrow w}$ . The scheduling stage decomposes the reconstructed interaction into structured Tasks and Subactions, then computes a bimanual action schedule over time. The synthesis stage generates grasp configurations and continuous robot trajectories that are consistent with the reconstructed interactions while respecting robot kinematics and physical plausibility. The augmentation stage randomizes object pose, object scale, camera pose, and point-cloud observations to produce a larger dataset for training a 3D Diffusion Policy (DP3) (Mu et al., 10 Feb 2026).

The task representation is explicit: $\tau = \big(\mathcal{E}_\tau,\; o_\tau,\; \mathcal{S}_\tau,\; k_\tau\big),$ where $\mathcal{E}_\tau\subseteq\{1,\dots,N\}$ denotes the involved embodiments, $o_\tau$ the manipulated object, $\mathcal{S}_\tau$ the ordered list of subactions, and $k_\tau$ the current subaction index. A Subaction is represented as

$s=(a_s,t_s), \quad a_s\in\{\text{pregrasp},\text{grasp},\text{motion},\text{release}\}.$

This structure makes the pipeline applicable to unimanual actions, cooperative bimanual actions, independent bimanual actions, and long-horizon sequences that combine them (Mu et al., 10 Feb 2026).

A plausible implication is that DexImit is best understood not as a monolithic imitation learner but as a staged data-generation and policy-training system. The paper’s emphasis is on converting weakly structured human video into robot-usable supervision, rather than on replacing all components with a single end-to-end model.

3. Monocular reconstruction and world-frame inference

The first stage reconstructs hand-object interaction from monocular video with near-metric scale. DexImit resamples the video to a fixed frame rate,

$\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ 0

uses Qwen3-VL for manipulated-object identification, applies Grounded-SAM2 to produce object, hand, and table masks, uses SpatialTracker v2 for unscaled depth estimation, Wilor for hand mesh estimation, SAM3D for object mesh generation, and FoundationPose++ for 6D object pose tracking (Mu et al., 10 Feb 2026).

Metric scale is obtained through a hand-size prior. DexImit extracts a first-frame hand point cloud from the hand mask, estimates a hand mesh with Wilor, aligns the mesh center to the point-cloud center, renders visible mesh vertices, and computes a scale factor through a PCA-based size ratio,

$\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ 1

This factor rescales the depth maps as $\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ 2, producing metric-scaled depth. The same align-render-align strategy is then used to align object mesh scale to the rescaled depth (Mu et al., 10 Feb 2026).

World coordinates are inferred from scene structure. The table normal defines the world $\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ 3-axis, the first-frame direction between hands is projected onto the table plane to define the $\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ 4-axis, and the $\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ 5-axis is set by right-handed completion. The origin is derived from the center of the axis-aligned bounding box of manipulated objects in the first frame and shifted to a canonical manipulation region. This yields a tabletop world frame with physically meaningful scale from arbitrary camera viewpoints (Mu et al., 10 Feb 2026).

The reconstruction is mesh- and pose-based rather than contact-labeled. Contact is implicit in hand-object proximity and later used as a prompt for grasp synthesis. The best reported object trajectory reconstruction configuration is SpatialTracker v2 with FoundationPose++, which achieves an 82% success rate on 100 short-horizon tasks, outperforming alternatives such as TA+RANSAC and VGGT+PCR (Mu et al., 10 Feb 2026).

4. Subtask decomposition, bimanual scheduling, and trajectory synthesis

After reconstruction, DexImit performs semantically guided subtask decomposition with Qwen3-VL. For long-horizon videos, the model produces segment-level descriptions and structured annotations, which are parsed into Tasks and Subactions; the authors additionally allow optional manual refinement for difficult cases. Scheduling is performed by an Action-Centric Scheduling algorithm that maintains a priority queue of active tasks, assigns subactions to embodiments over time, and ensures conflict-free bimanual coordination through per-hand action queues $\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ 6 (Mu et al., 10 Feb 2026).

Robot trajectory synthesis is not a direct joint retargeting step. Instead, reconstructed human hand poses act as a geometric and semantic prompt for grasp generation. For each grasping subaction, DexImit solves a BODex-style optimization over hand poses and contact forces: $\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ 7 The first term enforces target-wrench matching for force-closure; the remaining terms penalize off-surface contacts, hand-object collision, and hand-hand collision (Mu et al., 10 Feb 2026).

Candidate grasps are then ranked by similarity to the reconstructed human hand pose using translation and rotation error, and DexImit performs a stability check through simulation rollout. A candidate is accepted when the point-cloud discrepancy between the target object transformation and the simulated final transformation is below a threshold $\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ 8. After a grasp is selected, the grasped object and hand are treated as rigidly coupled, and motion planning produces smooth, collision-free trajectories between object keyframes (Mu et al., 10 Feb 2026).

This design addresses another frequent misunderstanding: DexImit is not limited to replaying video-derived kinematics. The generated robot data are physically filtered through force-closure, collision penalties, stability checks, and motion planning before they are used for policy learning.

5. Augmentation, policy learning, and empirical performance

The augmentation stage transforms a clean source trajectory into a training set for DP3. DexImit applies object pose randomization, object scale randomization, camera pose randomization, and point-cloud corruption. Scale factors are sampled in $\{p_o^t\}_{t=0}^{K_t}\in SE(3)$ 9, and the framework deliberately reuses the same motion structure across scales while refining finger articulations rather than regenerating entirely new grasps and motions for each scaled object. The paper argues that this avoids inconsistent supervision across demonstrations of the same task (Mu et al., 10 Feb 2026).

The trained policy is a 3D Diffusion Policy. Observations are 3D point clouds of the scene, primarily objects, together with proprioceptive states such as joint angles, in the standard DP3 setup. Actions are robot control variables following DP3-style parameterization. The training regime is supervised imitation learning over generated demonstrations, and each input video is converted into a source trajectory and then into 100 augmented demonstrations for policy training (Mu et al., 10 Feb 2026).

DexImit is evaluated on tasks that include tool use, long-horizon manipulation, and fine-grained stacking. Reported examples include Cut-Apple, Make-Beverage, and Stack Six Cups, as well as simulation tasks Put Cup, Grapefruit, Fruits, Pour, Pot, and Stack Cups. In simulation, DexImit attains 100% on Put Cup, 100% on Grapefruit, 100% on Fruits, 100% on Pour, 78% on Pot, and 52% on Stack Cups. The baselines RigVid and DexMan fail on many bimanual or long-horizon tasks, while DexImit is reported as the only method that successfully handles Pot at 78% and Stack Six Cups at 52% (Mu et al., 10 Feb 2026).

On real hardware, DexImit is deployed zero-shot on two UR5e arms with XHand dexterous hands and a Microsoft Azure Kinect RGB-D camera. The evaluated meta-tasks are Place Apple, Place Potato & Pepper, Place Pot, and Pour Water. In this context, “zero-shot” means no real-world robot demonstrations for these tasks and no task-specific real-robot fine-tuning. The paper reports high success rates across the four meta-tasks and qualitative evidence of robust performance (Mu et al., 10 Feb 2026).

The ablations are central to the interpretation of the method. Removing scale augmentation significantly reduces performance. Recomputing grasps and motions under scale augmentation yields worse performance than even no scale augmentation, because inconsistent demonstrations across scales confuse the diffusion policy. Removing point-cloud noise augmentation also reduces performance, indicating overfitting to clean simulation when real sensor noise is not modeled (Mu et al., 10 Feb 2026).

6. Relation to neighboring dexterous learning systems

DexImit occupies a distinct position within the dexterous learning landscape. Open TeleDex addresses the demonstration-collection problem through a ROS2-native, hardware-agnostic teleoperation stack that supports “AnyExternalDevice,” “AnyArm,” and “AnyHand,” produces synchronized multi-modal robot trajectories, and is intended as a data collection backbone for imitation learning (Chi et al., 16 Oct 2025). DexImit instead bypasses robot teleoperation at data-collection time and derives robot demonstrations from human video through reconstruction and synthesis (Mu et al., 10 Feb 2026).

DexViTac addresses a different but related regime: collecting human visuo-tactile-kinematic demonstrations for contact-rich dexterous manipulation with a portable system that records first-person vision, high-density fingertip tactile sensing, end-effector poses, and hand kinematics, then trains ACT policies from a kinematics-grounded tactile representation. It reports over 2,400 demonstrations, collection efficiency exceeding 248 demonstrations per hour, and an average success rate exceeding 85% across four real-world tasks (Chen et al., 18 Mar 2026). This suggests a complementary contrast: DexViTac scales human demonstration capture with dense multimodal sensing, whereas DexImit scales learning from ordinary monocular human videos (Mu et al., 10 Feb 2026).

DexSynRefine is closer to DexImit in that it also learns from human-object interaction data rather than direct robot teleoperation. Its pipeline combines a generative HOI motion prior, task-space residual RL, and contact-and-dynamics adaptation, and improves over kinematic retargeting by 50–70 percentage points on five real-robot tasks (Lee et al., 7 May 2026). The distinction is methodological: DexSynRefine starts from sparse HOI demonstrations with explicit human motion capture and uses RL to refine references, while DexImit starts from monocular videos, synthesizes robot trajectories through reconstruction and planning, and trains a 3D diffusion policy by imitation (Mu et al., 10 Feb 2026).

DexSim2Real provides an orthogonal comparison. It is a pure RL framework trained entirely in simulation, with Foundation Model-Guided Domain Randomization, a Tactile-Visual Cross-Attention Policy, and a Progressive Skill Curriculum, and it emphasizes zero-shot sim-to-real transfer without real demonstrations (Zeng et al., 3 May 2026). DexImit, by contrast, is not an RL-only alternative; it treats human videos as the scalable data source and uses synthetic robot demonstrations to train policies (Mu et al., 10 Feb 2026).

DexImit’s limitations define its current scope. The authors identify pipeline modularity and error propagation, the absence of support for deformable or articulated objects, the lack of in-hand manipulation, a tabletop-only assumption, and limited fully automatic quality for very long-horizon tasks, where manual correction can still be useful. These limitations constrain the present system, but they also clarify its contribution: an automated route from monocular human video to robot-executable, bimanual dexterous behavior under near-metric reconstruction, structured scheduling, physics-aware synthesis, and zero-shot deployment (Mu et al., 10 Feb 2026).