Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation (2503.02881v3)

Published 4 Mar 2025 in cs.RO, cs.AI, and cs.LG

Abstract: Humans can accomplish complex contact-rich tasks using vision and touch, with highly reactive capabilities such as fast response to external changes and adaptive control of contact forces; however, this remains challenging for robots. Existing visual imitation learning (IL) approaches rely on action chunking to model complex behaviors, which lacks the ability to respond instantly to real-time tactile feedback during the chunk execution. Furthermore, most teleoperation systems struggle to provide fine-grained tactile / force feedback, which limits the range of tasks that can be performed. To address these challenges, we introduce TactAR, a low-cost teleoperation system that provides real-time tactile feedback through Augmented Reality (AR), along with Reactive Diffusion Policy (RDP), a novel slow-fast visual-tactile imitation learning algorithm for learning contact-rich manipulation skills. RDP employs a two-level hierarchy: (1) a slow latent diffusion policy for predicting high-level action chunks in latent space at low frequency, (2) a fast asymmetric tokenizer for closed-loop tactile feedback control at high frequency. This design enables both complex trajectory modeling and quick reactive behavior within a unified framework. Through extensive evaluation across three challenging contact-rich tasks, RDP significantly improves performance compared to state-of-the-art visual IL baselines. Furthermore, experiments show that RDP is applicable across different tactile / force sensors. Code and videos are available on https://reactive-diffusion-policy.github.io.

Summary

The paper introduces Reactive Diffusion Policy (RDP), a slow-fast visual-tactile imitation learning algorithm, along with TactAR, a system providing real-time AR tactile feedback, to tackle contact-rich manipulation challenges.
RDP employs a hierarchical structure combining a slow latent diffusion policy for high-level action chunks and a fast asymmetric tokenizer for high-frequency tactile feedback control, enabling precise force adjustment.
Experiments demonstrate that RDP significantly surpasses visual or basic tactile baselines, achieving over 35% performance improvement across tasks like Peeling, Wiping, and Bimanual Lifting, and is compatible with diverse tactile sensors.

The paper introduces Reactive Diffusion Policy (RDP), a slow-fast imitation learning algorithm designed for contact-rich manipulation tasks, along with TactAR, a versatile and low-cost teleoperation system. The core idea behind RDP is to mimic the human control strategy of combining feedforward/predictive actions with closed-loop fine-tuning based on sensory feedback, particularly tactile signals.

The primary contribution lies in addressing the limitations of existing visual imitation learning (IL) approaches that rely on action chunking, which struggle to respond to real-time tactile feedback during chunk execution. Furthermore, the paper tackles the challenge of providing fine-grained tactile/force feedback in teleoperation systems, which often restricts the range of achievable tasks.

To overcome these limitations, the authors propose TactAR, which leverages Augmented Reality (AR) via Meta Quest 3 to provide real-time tactile feedback. TactAR represents tactile/force feedback as a 3D deformation field, applicable across different sensors, and renders it in AR, attached to the robot end-effector. This allows the user to perceive contact information, including tactile images, normal force, shear force, and torsional torques.

Complementing TactAR, RDP employs a two-level hierarchy:

A slow latent diffusion policy (LDP) predicts high-level action chunks in a latent space at a low frequency (1-2 Hz), akin to predictive action. The LDP is a diffusion model operating in the latent space of an asymmetric tokenizer.
A fast asymmetric tokenizer (AT) enables closed-loop tactile feedback control at a high frequency (20-30 Hz), analogous to closed-loop fine-tuning. The AT is an encoder-decoder structure that reconstructs actions based on latent action chunks and high-frequency tactile feedback.

This slow-fast hierarchical structure allows for modeling complex, non-Markovian actions via the diffusion model and action chunking in the slow network, while enabling real-time response to tactile feedback in the fast network for precise force control.

The paper details the components of TactAR, including the 3D deformation field extraction from marker motion on tactile optical sensors. The original image in frame $t$ captured by the optical tactile sensor is denoted as $I_{t}$ . The normalized 2D location of the marker arrays $D_{t}$ is extracted with OpenCV from the image $I_{t}$ . The 2D optical flow from the undeformed frame $D_0$ to the current frame $D_{t}$ is calculated as $F_t = [\mathbf{d}_x, \mathbf{d}_y]=Flow(D_{0}, D_{t})$ , where $\mathbf{d}_x$ and $\mathbf{d}_y$ represent the displacement in the x and y axes. This yields the 3D deformation field $V_t=[\mathbf{d}_x, \mathbf{d}_y, \mathbf{o_z}]$ used in AR, where $\mathbf{o_z}$ represents the z-axis offsets for 3D visualization. For force sensors, $V_t=[\mathbf{f}_x, \mathbf{f}_y, \mathbf{f_z}]$ is used for 3D visualization, where $\mathbf{f}_x$ , $\mathbf{f}_y$ , and $\mathbf{f_z}$ are the forces in the x, y, and z axes.

The TactAR system consists of a workstation node, RGB camera nodes, tactile camera nodes, a robot controller node, and the VR headset node, synchronized using ROS2.

The asymmetric tokenizer (AT) comprises a 1D-CNN encoder $\mathscr{E}$ and a Gated Recurrent Unit (GRU) decoder $\mathscr{D}$ . Given an action chunk $\mathbf{A}\in \mathbb{R} ^{T\times D}$ , the encoder downsamples it to a latent one $\mathbf{Z} = \mathscr{E}(\mathbf{A})\in \mathbb{R}^{t\times d}$ . The decoder reconstructs the action via $\hat{\mathbf{A} = \mathscr{D}(\text{concat}([\mathbf{Z}, \mathbf{F}^{reduced}]))$, where $\mathbf{F}^{reduced}$ is the corresponding tactile representation sequence. The AT is trained using an L1 reconstruction loss and a Kullback-Leibler (KL) penalty loss, as shown in the following equation:

$L_{AT} = \mathbb{E}_{(\mathbf{A}, \mathbf{F}^{reduced})\in\mathcal{D}_{policy} \left[\|\mathbf{A} - \hat{\mathbf{A}\|_1 + \lambda_{KL}L_{KL}\right]}$.

Where:

$L_{AT}$ is the loss of the asymmetric tokenizer
$\mathbf{A}$ is the action chunk
$\hat{\mathbf{A}}$ is the reconstructed action
$\mathbf{F}^{reduced}$ is the reduced tactile representation
$\mathcal{D}_{policy}$ is the policy learning dataset
$\lambda_{KL}$ is the coefficient for the Kullback-Leibler divergence loss
$L_{KL}$ is the Kullback-Leibler divergence loss

The slow policy is modeled as a Diffusion Policy operating on latent action chunks, termed Latent Diffusion Policy (LDP). During training, given the observation $\mathbf{O}$ , the gradient field is learned by $\epsilon_\theta$ , and the Denoising Diffusion Probabilistic Models (DDPM) training objective is:

$L_{LDP} = \mathbb{E}_{(\mathbf{O},\mathbf{A}^0)\in\mathcal{D}_{policy}, k, \epsilon^k} \| \epsilon^k - \epsilon_{\theta}(\mathbf{O}, \mathbf{Z}^0 + \epsilon^k, k) \|_2$

Where:

$L_{LDP}$ is the loss of the latent diffusion policy
$\mathbf{O}$ is the observation
$\mathbf{A}^0$ is the initial action
$\mathcal{D}_{policy}$ is the policy learning dataset
$k$ is the iteration index
$\epsilon^k$ is random noise at iteration $k$
$\epsilon_{\theta}$ is the learned gradient field
$\mathbf{Z}^0$ is the latent representation of $\mathbf{A}^0$

Experiments were conducted on three challenging contact-rich tasks: Peeling, Wiping, and Bimanual Lifting. The tasks require precision, adaptive force control, and bimanual coordination, respectively. The authors compared RDP against Diffusion Policy baselines with visual input only, tactile images, and tactile embeddings.

The results indicate that RDP significantly outperforms the baselines, achieving a performance improvement of over 35% across the three tasks. The experiments also demonstrated that RDP is applicable across different tactile/force sensors, including GelSight Mini, MCTac, and built-in joint torque sensors. Ablation studies validate the importance of the slow-fast hierarchy, relative trajectory prediction, and latency matching for RDP's performance. The paper also shows that TactAR improves the quality of human demonstrations by improving the stability of contact forces.

The paper identifies limitations such as the non-intuitive nature of AR-based tactile feedback, the restriction to two-finger grippers, and the inability to process high-frequency image inputs in the fast policy. Future research directions include improving teleoperation efficiency, extending the system to dexterous hands, incorporating high-frequency visual inputs, and integrating RDP with Vision-Language-Action (VLA) models.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (8)

GitHub

Reactive Diffusion Policy

Tweets

https://twitter.com/RoboReading/status/1897331527916327152

https://twitter.com/fly51fly/status/1898494573745431013

https://twitter.com/winninghelix/status/1898262147370529034

https://twitter.com/Gu__Zhang/status/1897678155760161132

https://twitter.com/HanXue012/status/1906335190185591047

https://twitter.com/HanXue012/status/1897676997528355249