ActiveUMI: Robotic Manipulation with Active Perception from Robot-Free Human Demonstrations

Published 2 Oct 2025 in cs.RO and cs.CV | (2510.01607v1)

Abstract: We present ActiveUMI, a framework for a data collection system that transfers in-the-wild human demonstrations to robots capable of complex bimanual manipulation. ActiveUMI couples a portable VR teleoperation kit with sensorized controllers that mirror the robot's end-effectors, bridging human-robot kinematics via precise pose alignment. To ensure mobility and data quality, we introduce several key techniques, including immersive 3D model rendering, a self-contained wearable computer, and efficient calibration methods. ActiveUMI's defining feature is its capture of active, egocentric perception. By recording an operator's deliberate head movements via a head-mounted display, our system learns the crucial link between visual attention and manipulation. We evaluate ActiveUMI on six challenging bimanual tasks. Policies trained exclusively on ActiveUMI data achieve an average success rate of 70\% on in-distribution tasks and demonstrate strong generalization, retaining a 56\% success rate when tested on novel objects and in new environments. Our results demonstrate that portable data collection systems, when coupled with learned active perception, provide an effective and scalable pathway toward creating generalizable and highly capable real-world robot policies.

Abstract PDF Upgrade to Chat

Summary

The paper introduces ActiveUMI, a framework using VR-based human demonstrations to improve robotic manipulation via active perception, achieving a 70% success rate on standard tasks.
It employs a novel setup combining VR tracking, custom controllers, and portable computation to accurately map human motions to robot kinematics and drive visuomotor learning.
The approach outperforms conventional methods by offering efficient data collection, precise calibration, and robust generalization with a 56% success rate in novel scenarios.

ActiveUMI: Robotic Manipulation with Active Perception from Robot-Free Human Demonstrations

Introduction

The paper presents ActiveUMI, a data collection framework designed to improve robotic manipulation through active perception by leveraging human demonstrations without the direct involvement of robots during the data collection process. ActiveUMI integrates a VR setup to capture human movements, synchronize them with robot kinematics, and enhance the learning of visuomotor policies. This method bridges the gap between human demonstrations in everyday environments and the embodiment requirements of robots, focusing on active perception to achieve higher task success rates.

System Architecture and Setup

ActiveUMI utilizes a VR headset, custom controllers, and a portable computational setup to map human actions to robotic movements:

VR Setup: A VR headset, equipped with front-facing cameras, is used for motion tracking and capturing the operator's perspective. The controllers mounted with robot grippers enable the precise replication of human hand movements, ensuring alignment with the robot’s kinematics.
Portable Computation: The system is self-contained within a backpack, hosting the computational resources required for real-time data processing and avoiding dependency on stationary setups.
Active Perception: This is achieved by tracking head movements via the VR headset, allowing the robot's policy to learn visual attention patterns crucial for overcoming occlusions and improving task performance.
Figure 1: Overview of ActiveUMI Hardware. A VR headset with custom controllers designed to replicate the structure of the robot's grippers. A portable backpack that holds a battery and a PC for self-contained operation.

Data Collection Process

The data collection process is designed for capturing diverse human demonstrations that are aligned with robotic capabilities:

Gripper and Camera Integration: Grippers are mounted on the VR controllers, and additional wrist-mounted cameras provide comprehensive views of the operational environment, enriching the data fed into visuomotor models.
Calibration: Various calibration methods ensure the accurate mapping of VR-tracked movements to robotic actions. Techniques include in-situ environment setup, using physical placeholders for consistent calibration, and incorporating haptic feedback for zero-point precision.
Figure 2: Overview of ActiveUMI. The left side illustrates our data collection process and the detailed dataset configuration. The right side shows the model deployment and inference process.

Experimental Evaluation

ActiveUMI's effectiveness was validated through experiments on six challenging bimanual tasks, testing both in-distribution performance and generalization:

In-Domain Performance: Policies trained exclusively on ActiveUMI data achieved an average success rate of 70% over various tasks, outperforming those trained with traditional UMI setups that lack active perception capabilities. These tasks require precision, adaptive viewpoint control, and handling of occlusions.
Generalization: ActiveUMI policies retained a 56% success rate when confronted with novel objects and environments, demonstrating the framework's robustness and adaptability to unseen scenarios.
Figure 3: Evaluated Tasks. Each task evaluated involves different skill sets such as precision, deformable object manipulation, and long-horizon task execution.

Comparison with Conventional Approaches

An in-depth comparison was made between ActiveUMI and other prevalent methods, such as static camera setups and wrist-camera-based data collection. The results highlighted:

Efficiency: ActiveUMI's data collection process is faster and yields higher-quality data compared to teleoperation systems. Using tasks like rope boxing and shirt folding as benchmarks, ActiveUMI demonstrated substantial throughput advantages.
Accuracy: The use of VR tracking inherently reduces error, offering a precise data collection modality that aligns well with robotic control requirements.
Figure 4: Data Collection Comparison. Shows efficiency comparisons among ActivateUMI, bare hand, and teleoperation, highlighting ActiveUMI's efficiency and accuracy benefits.

Implications and Future Work

The research underscores the significance of active perception in robotic manipulation, advocating for further integration of human-like visual strategies into robotic learning systems. The portable, cost-effective nature of ActiveUMI presents a scalable approach to collect vast amounts of high-quality data in real-world settings. Future trajectories may explore augmenting this framework with additional sensory modalities and extending its application to more complex, machine learning-driven robotic tasks.

Conclusion

ActiveUMI effectively addresses the limitations of conventional data collection in robotics by emphasizing active perception. The successful implementation and testing of this framework showcase its potential to enhance robot autonomy and adaptability, setting a foundation for developing robust, generalizable robotic manipulation policies without the constraints of controlled environments.