DexJoCo: Benchmark for Dexterous Manipulation
- DexJoCo is a benchmark that systematically evaluates dexterous robotic manipulation using multi-fingered hands, emphasizing tool-use and bimanual coordination.
- The toolkit integrates a robust teleoperation platform and a large-scale, high-frequency dataset of human demonstrations to enhance reproducibility.
- Rigorous evaluation protocols reveal challenges in visual and dynamics randomization, underscoring the need for advanced perception and adaptive control strategies.
DexJoCo is a comprehensive benchmark and toolkit designed for systematic evaluation and development of dexterous robotic manipulation in simulated environments. Leveraging MuJoCo physics, DexJoCo enables reproducible testing of task-oriented dexterous hand behavior involving both complex single-arm and bimanual interactions, with a focus on manipulation capabilities that exceed those of parallel grippers. The benchmark integrates a functionally diverse set of manipulation tasks, a large-scale dataset of human teleoperated demonstrations, robust evaluation protocols, and a user-oriented toolkit for data collection and domain randomization (Wang et al., 15 May 2026).
1. Motivation and System Definition
DexJoCo addresses critical limitations of existing dexterous manipulation benchmarks, namely the lack of standardized and functionally challenging tasks that exploit the unique advantages of multi-fingered hands, insufficient coverage of tool use, reasoning, bimanual coordination, and the absence of high-quality demonstration datasets for imitation learning. The benchmark pairs a 7-DoF Franka Panda arm with a 16-DoF Allegro Hand, using MuJoCo as the physics engine. The system’s design enables the evaluation of manipulation behaviors beyond the reach of classical grippers, facilitating direct investigation of dexterous prehension, in-hand tool-use, and long-horizon planning (Wang et al., 15 May 2026).
2. Task Suite and Benchmark Construction
DexJoCo’s task suite consists of eleven “functionally grounded” manipulation scenarios, each specified as , where denotes the sets of interactive objects (e.g., hammer, nail, tongs, iPad, disks), and denotes functional constraints—temporal order, object pose, articulated joint states, and required contact. Success demands that all be satisfied simultaneously.
Single-arm tasks emphasize tool-use and fine manipulation (e.g., “Hammer Nail,” “Click Mouse,” “Fold Glasses,” “Water Plant”). Bimanual and long-horizon tasks require asymmetric or coordinated dual-arm roles and reasoning (e.g., “Unlock iPad,” “Tower of Hanoi,” “Assembly,” “Microwave Cook,” “Photograph”). Each task’s success criteria are tightly linked to functional outcomes, not just geometric configuration, ensuring evaluation reflects true manipulative skill.
Simulation employs MuJoCo v2.x with assets from the MuJoCo Menagerie, RoboSuite, and RoboCasa. The robot model supports 6D end-effector control and 16D Allegro hand joint control, with observations comprising multi-view RGB, depth, joint states, and object poses. The system interfaces via a Python API, exposing reset, actuation, and structured success queries (Wang et al., 15 May 2026).
3. Toolkit: Teleoperation and Data Collection
A salient feature of DexJoCo is the availability of an affordable teleoperation platform (~$2.3K), combining a Rokoko Smartglove (per-finger pose at 120 Hz), HTC Vive trackers (6D wrist pose), and custom mounts. This system empowers collection of rich human demonstration trajectories as ground truth.
The retargeting model, GeoRT, maps human operator fingertip keypoints $x_Hq_RC = L_\text{dir} + L_\text{cover} + L_\text{flat} + L_\text{pinch} + L_\text{col}g \in GkP(a_{t:t+k-1}) = \pi_\theta(a_{t:t+k-1} | s_{t-h+1:t}, \ell)$
Rigorous ablation studies analyze visual and dynamics randomization, joint vs. separate training, and partial action-head re-use for adaptable architectures (Wang et al., 15 May 2026).
5. Empirical Findings and Failure Analysis
Baseline results reveal that To.5 achieves the highest single-task regime success rate (52.5% ± 1.4), with Diffusion Policy competitive, especially on bimanual tasks. Visual domain randomization induces a marked –15–20 percentage point drop in success rates, exposing limited robustness to visual perturbations. Dynamics randomization further depresses performance, but also sharpens differences in sim-to-real robustness, e.g., To.5 outperforms DP-T when friction and mass vary.
Critical failure modes are identified:
- Button-based tasks (Unlock iPad, Click Mouse): frequent misses due to lack of precise fingertip localization.
- Insertion and stacking tasks (Assembly, Tower of Hanoi): alignment and memory errors.
- Pinch Tongs: policies execute grasps but fail at hinge actuation, indicating limited temporal and force control.
- Multi-stage tasks (Microwave): subgoal misordering, especially under temporal constraints and partial observability.
Multi-task training generally degrades policy performance versus per-task specialization, with only isolated task-specific improvements. Partially reusing pretrained action-head weights (partial pretrain-AH) offers a consistent 3–5 percentage point boost over full random reinitialization.
6. Challenges, Limitations, and Future Directions
DexJoCo exposes several open challenges in dexterous robot learning:
- Current vision–language–action (VLA) models are predominantly gripper-centric and not optimized for the high-DoF actuation demanded by dexterous hands. Large-scale dexterous hand pretraining and flexible action head design are pressing needs.
- Vision-only policies lack force and contact feedback needed for precise insertion, sliding, and articulated object control; integration of tactile sensing and multi-modal observation spaces is prioritized.
- Despite extensive domain randomization, sim-to-real transfer remains incomplete, highlighting the need for higher-fidelity physics and sensor modeling, especially for contact-rich manipulations.
Comprehensive empirical analysis underscores that fine-grained manipulation, visual robustness, and temporal reasoning remain substantive challenges. A plausible implication is that progress will require coordinated advances in representation learning, multi-modal perception, simulation realism, and adaptive control architectures (Wang et al., 15 May 2026).
7. Context: Related Work and Distinctions
Relative to prior benchmarks, DexJoCo uniquely offers broad coverage of tool-use, bimanual, long-horizon, and reasoning-based tasks, alongside functionally-driven success metrics and high-quality human demonstration data. Its modular toolkit, data collection infrastructure, and domain randomization methodology set it apart as a standardization vehicle for future research in dexterous manipulation (Wang et al., 15 May 2026).
While DexJoCo as defined in (Wang et al., 15 May 2026) pertains primarily to ground-based dexterous manipulator arms, there exists parallel work on dexterous manipulation for free-flying robots (e.g., space robotics). However, these adaptations (such as simulations with the DexCoHand on Astrobee) are structurally distinct in hardware, compliance, and operational context (Su et al., 18 May 2026).
DexJoCo constitutes both a rigorous testbed and a resource for advancing dexterous robot learning, emphasizing reproducibility, realistic task complexity, and actionable empirical diagnostics. Its benchmark suite and toolkit will likely underpin future progress in dexterous manipulation research and deployment.