- The paper introduces the DexWild framework, co-training human and robot demos to achieve a 68.5% success rate in novel environments.
- It employs a portable, calibration-free DexWild-System to collect high-fidelity human data 4.6× faster than traditional teleoperation.
- The diffusion-based policy enables effective zero-shot skill and cross-embodiment transfer, outperforming robot-only baselines by up to 8.3×.
This paper, "DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies" (2505.07813), presents a framework for learning dexterous robot manipulation policies that generalize effectively to novel objects, environments, and robot embodiments. The core challenge addressed is the difficulty in acquiring large-scale, diverse robot datasets needed for training such generalist policies. Existing methods like robot teleoperation are expensive and hard to scale, while internet videos lack the precise action information required for low-level control.
DexWild proposes leveraging large-scale, high-fidelity human demonstrations collected in diverse real-world environments and co-training them with a smaller amount of robot demonstration data. This approach aims to combine the diversity and scale of human interaction data with the necessary embodiment grounding provided by robot data.
The paper introduces the DexWild-System, a portable, low-cost hardware system designed for efficient and high-fidelity human data collection in the wild. Key aspects of the system include:
- Portability: Lightweight components (tracking camera, mini-PC, sensor pod with glove and palm cameras) allowing setup in minutes across diverse locations.
- Calibration-Free Operation: Achieved by using a relative state-action representation and ArUco tracking for wrist pose, robust to feature-sparse environments and occlusions. Palm-mounted cameras on both human and robot embodiments provide aligned egocentric visual observations.
- High Fidelity: Utilizes motion capture gloves for accurate hand pose estimation, robust to occlusions, and palm cameras for detailed, synchronized interaction views with minimal motion blur.
- Embodiment-Agnostic Design: Aligning visual observations and retargeting human fingertip positions to robot hand kinematics allows collected data to be used across various robot hands.
Using DexWild-System, the authors collected DH, a large human demonstration dataset (9,290 demos across 93 environments and 5 tasks), significantly faster (4.6×) than traditional robot teleoperation. A smaller robot dataset DR (1,395 demos) was also collected using an xArm with a LEAP Hand V2 Advanced.
The DexWild Learning Framework co-trains policies on these two datasets using a behavior cloning (BC) objective. The training process involves sampling batches from DH and DR according to a fixed ratio. Observations include synchronized palm camera images processed by a pre-trained Vision Transformer (ViT) encoder and a history of relative end-effector positions. Actions are represented as 26-dimensional vectors comprising relative arm pose and finger joint positions (52 dimensions for bimanual tasks). Action normalization and heuristic-based demo filtering are applied to make the datasets compatible and improve quality. A diffusion-based policy architecture is used, as it can effectively model multimodal action distributions present in human data. The training algorithm uses a standard diffusion loss to predict noise added to action chunks.
The paper presents extensive real-world experiments across five diverse manipulation tasks (Spray Bottle, Toy Cleanup, Pouring, Bimanual Florist, Clothes Folding) and three types of evaluation environments (In-Domain, In-the-Wild, In-the-Wild Extreme). Key experimental findings demonstrate:
- In-the-Wild Generalization: Co-training (specifically at a 1:2 Robot:Human data ratio) significantly improves performance in novel environments compared to policies trained on robot data alone. Policies trained with DexWild achieved a 68.5\% success rate in completely unseen environments, nearly four times higher than robot-only policies (22.0%).
- Cross-Task Transfer: DexWild enables effective zero-shot transfer of skills. By training on human pouring demos and robot spray demos, the policy successfully performed the pouring task in novel environments with high success (94%), demonstrating the transfer of underlying manipulation primitives.
- Cross-Embodiment Generalization: The embodiment-agnostic design facilitates zero-shot transfer to new robot arms (Franka Panda) and different dexterous hands (LEAP Hand V1), showing substantial performance gains over robot-only baselines (e.g., 8.3× improvement for cross-arm transfer).
- Scalability: Performance of DexWild policies scales positively with the amount of human data used in co-training, indicating that collecting more human data can lead to further improvements.
- Efficient Data Collection: DexWild-System enables data collection at a rate of 201 demos/hour, which is 4.6× faster than a Gello-based robot teleoperation system.
The authors acknowledge several limitations: the reliance on some robot data for grounding, the lack of explicit error recovery examples in successful human demonstrations, and the current limitation to visual and kinematic data, which might hinder performance in contact-rich tasks.
Overall, DexWild demonstrates that leveraging large-scale, diverse human interaction data, efficiently collected in the wild with a purpose-built system and effectively combined with minimal robot data through co-training, is a promising path towards achieving generalizable and dexterous robot manipulation policies. The project website provides videos, code, and hardware instructions for implementation.