Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 100 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 103 tok/s

GPT OSS 120B 480 tok/s Pro

Kimi K2 215 tok/s Pro

2000 character limit reached

Vision in Action: Learning Active Perception from Human Demonstrations (2506.15666v1)

Published 18 Jun 2025 in cs.RO

Abstract: We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot's physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot's latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.

Collections

Summary

The paper introduces ViA, a novel system that learns active perception strategies from human demonstrations using a VR teleoperation interface and a 6-DoF active head.
It employs a DINOv2-pretrained vision transformer to process RGB-D data, enabling coordinated head and arm movements for complex, occlusion-challenged tasks.
Evaluations show a 45% improvement over fixed-camera setups and reduced user motion sickness through a low-latency point cloud VR interface.

The paper "Vision in Action: Learning Active Perception from Human Demonstrations" (2506.15666) introduces ViA, a bimanual robot manipulation system designed to learn active perception strategies directly from human demonstrations. Traditional robotic manipulation systems often rely on fixed or wrist-mounted cameras, which struggle in scenarios with visual occlusions, failing to capture rich perceptual behaviors like searching, tracking, and focusing that are crucial for complex tasks. ViA addresses this by incorporating an active camera system and a novel teleoperation method for data collection.

The core of the ViA system comprises a unique hardware setup, a specialized teleoperation interface for data collection, and a visuomotor policy learning framework.

Hardware Design:

Instead of a traditional limited-DoF neck, ViA uses an off-the-shelf 6-DoF robot arm (ARX5) as the robot's neck. This allows for flexible, human-like head movements, approximating the range of motion from coordinated upper-body movement. An iPhone 15 Pro is mounted on the end-effector of this "neck" to serve as the active head camera, providing real-time RGB and depth data along with synchronized camera poses. For bimanual manipulation, the system uses two additional 6-DoF ARX5 robot arms, each with a parallel-jaw gripper, mounted on custom shoulder structures.

Teleoperation Interface:

To collect human demonstrations that include active perception strategies, the authors developed a VR-based teleoperation interface. This interface controls both the bimanual arms (using an exoskeleton) and the active head camera (using VR). A key challenge in VR robot teleoperation is motion-to-photon latency, the delay between a user's head movement and the visual feedback, which can cause motion sickness.

ViA mitigates this by decoupling the user's view from the robot's physical camera movement using an intermediate 3D scene representation, specifically a point cloud built in the world frame from the active head camera's RGB-D data.

Point Cloud Construction: Each RGB-D frame from the robot's head camera is transformed into a world frame point cloud using camera intrinsics and the camera's pose.
Low-Latency View Rendering: The VR display renders stereo RGB views in real-time from this point cloud using the user's latest head pose. This rendering happens at a high frequency (e.g., 150 Hz), ensuring smooth, low-latency visual feedback aligned with the user's head movements, even if the point cloud is slightly outdated.
Asynchronous Point Cloud Updating: The robot's physical head pose is updated to match the user's aggregated head movements asynchronously at a lower frequency (e.g., 10 Hz), matching the robot's control rate. The point cloud itself is updated with new RGB-D observations from the robot at this lower frequency.

This design allows the user to move their head and receive immediate visual feedback (rendered view), preserving perceptual continuity and reducing motion sickness, while the robot's physical movements and scene updates occur at a rate dictated by robot control capabilities.

Policy Learning Framework:

The system learns visuomotor policies using an approach based on Diffusion Policy. The policy takes the current RGB image from the active head camera and the robot's proprioceptive state as input.

Visual Input: The RGB image from the head camera is processed by a DINOv2-pretrained ViT backbone. The classification token (a 384-dimensional vector) serves as a compact semantic representation.
Proprioceptive State: This includes the end-effector poses (position and quaternion, $\in\mathbb{R}^7$ ) for the neck, left arm, and right arm, plus the two gripper widths (2 scalars), totaling $\mathbb{R}^{23}$ .
Policy Output: The policy predicts a sequence of future actions ( $\in\mathbb{R}^{n_p \times 23}$ ), where each action specifies the desired future end-effector poses for the neck, arms, and gripper widths in the world frame. Only the first $n_a$ actions from the predicted sequence are executed. The authors use a prediction horizon $n_p=16$ and an execution horizon $n_a=8$ , with the policy operating at 10 Hz.

This policy learns coordinated head and arm movements to achieve task goals, inferring active perception strategies (like searching and focusing) from the human demonstrations collected via the VR interface.

Evaluation:

The system was evaluated on three challenging bimanual manipulation tasks involving significant visual occlusion:

Bag Task: Open a bag, peek inside to find a target object, and retrieve it. Requires interacting with the environment (opening bag) and active viewing (peeking inside).
Cup Task: Find and pick up a cup from a shelf, hand it to the other arm, and place it on a hidden saucer. Requires active viewpoint switching to see objects in cluttered/occluded locations.
Lime & Pot Task: Find a lime, place it in a pot, lift the pot bimanually, and precisely align it on a trivet. Requires searching, bimanual coordination, and precise visual guidance.

Performance was measured by stage-wise success rates, where success is cumulative through stages.

Experimental Findings:

Camera Setup: Comparing [ViA (Active Head Only)], [Active Head + Wrist Cameras], and [Chest + Wrist Cameras], ViA significantly outperformed the baselines (45% higher average success rate over [Chest + Wrist Cameras]). Interestingly, adding wrist cameras did not improve performance, suggesting the active head camera provided sufficient information and additional views might introduce noise or redundant information.
Visual Representation: Comparing [ViA (DINOv2)], [ResNet-DP], and [DP3 (3D Point Cloud)] using the active head camera input, DINOv2 yielded the best results. This highlights the benefit of strong semantic understanding from the pretrained DINOv2 backbone for tasks requiring object search and identification. The DP3 point cloud representation struggled, exhibiting failure modes like hallucination, possibly due to lacking pretrained visual priors and the inherent challenges of 3D reconstruction quality from single frames.
Teleoperation Interface: A user paper comparing the proposed point cloud rendering interface with traditional RGB streaming showed that while point cloud rendering resulted in slightly longer demonstration times, it drastically reduced motion sickness. 6 out of 8 participants preferred the point cloud rendering system.

Implementation Considerations and Limitations:

Computational Requirements: The system involves real-time processing of RGB-D data, point cloud construction, view rendering, and policy inference using a large visual backbone (DINOv2 ViT). This requires significant computational resources, likely GPUs for visual processing and policy execution.
Hardware Complexity: While using an off-the-shelf arm for a neck is simpler than a custom design, a 6-DoF arm still adds complexity to the overall robot system control and kinematics. Bimanual arms further increase this complexity.
Data Collection: Although the VR interface improves comfort, collecting a sufficient number of diverse demonstrations for complex tasks (125-260 per task in this work) remains labor-intensive.
Policy Limitations: The current policy lacks memory capabilities, which could be important for tasks involving extensive search. It is also not conditioned on language, which could enable more flexible and generalizable task execution based on human instructions. The paper notes that fusing observations from different cameras in a learned shared space might improve performance compared to simple feature concatenation.
Teleoperation Fidelity: The point cloud generated from single RGB-D frames can be noisy and incomplete, leading to lower visual fidelity compared to direct RGB streaming, which might make fine-grained manipulation challenging for new users. Future work could explore more advanced 3D reconstruction/rendering techniques like dynamic Gaussian Splatting [4DGS] or NeRFs [eth_nerf_teleop, jacobinerf] for better visual quality.

In practice, deploying such a system would require careful calibration between the robot arms, the neck arm, and the camera to ensure accurate pose estimation for point cloud construction and action execution via inverse kinematics. The asynchronous control loop of the teleoperation system needs robust handling of time synchronization and data buffering. The policy inference speed is critical for a 10 Hz control loop, necessitating efficient hardware and model optimization.

Overall, ViA presents a practical approach to enabling and learning active perception in robot manipulation by addressing key hardware and teleoperation challenges, demonstrating its effectiveness on challenging tasks involving occlusions. The work highlights the significant benefits of actively controlling camera viewpoints based on task needs, particularly when learning from human examples.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (6)

Tweets

https://twitter.com/RoboReading/status/1936438365177729492

https://twitter.com/Haoyu_Xiong_/status/1935724536026333370

https://twitter.com/EmbodiedAIRead/status/1949724490365370661

https://twitter.com/simulately12492/status/1938171192231531008