- The paper introduces Reactive Diffusion Policy (RDP), a slow-fast visual-tactile imitation learning algorithm, along with TactAR, a system providing real-time AR tactile feedback, to tackle contact-rich manipulation challenges.
- RDP employs a hierarchical structure combining a slow latent diffusion policy for high-level action chunks and a fast asymmetric tokenizer for high-frequency tactile feedback control, enabling precise force adjustment.
- Experiments demonstrate that RDP significantly surpasses visual or basic tactile baselines, achieving over 35% performance improvement across tasks like Peeling, Wiping, and Bimanual Lifting, and is compatible with diverse tactile sensors.
The paper introduces Reactive Diffusion Policy (RDP), a slow-fast imitation learning algorithm designed for contact-rich manipulation tasks, along with TactAR, a versatile and low-cost teleoperation system. The core idea behind RDP is to mimic the human control strategy of combining feedforward/predictive actions with closed-loop fine-tuning based on sensory feedback, particularly tactile signals.
The primary contribution lies in addressing the limitations of existing visual imitation learning (IL) approaches that rely on action chunking, which struggle to respond to real-time tactile feedback during chunk execution. Furthermore, the paper tackles the challenge of providing fine-grained tactile/force feedback in teleoperation systems, which often restricts the range of achievable tasks.
To overcome these limitations, the authors propose TactAR, which leverages Augmented Reality (AR) via Meta Quest 3 to provide real-time tactile feedback. TactAR represents tactile/force feedback as a 3D deformation field, applicable across different sensors, and renders it in AR, attached to the robot end-effector. This allows the user to perceive contact information, including tactile images, normal force, shear force, and torsional torques.
Complementing TactAR, RDP employs a two-level hierarchy:
- A slow latent diffusion policy (LDP) predicts high-level action chunks in a latent space at a low frequency (1-2 Hz), akin to predictive action. The LDP is a diffusion model operating in the latent space of an asymmetric tokenizer.
- A fast asymmetric tokenizer (AT) enables closed-loop tactile feedback control at a high frequency (20-30 Hz), analogous to closed-loop fine-tuning. The AT is an encoder-decoder structure that reconstructs actions based on latent action chunks and high-frequency tactile feedback.
This slow-fast hierarchical structure allows for modeling complex, non-Markovian actions via the diffusion model and action chunking in the slow network, while enabling real-time response to tactile feedback in the fast network for precise force control.
The paper details the components of TactAR, including the 3D deformation field extraction from marker motion on tactile optical sensors. The original image in frame t captured by the optical tactile sensor is denoted as It. The normalized 2D location of the marker arrays Dt is extracted with OpenCV from the image It. The 2D optical flow from the undeformed frame D0 to the current frame Dt is calculated as Ft=[dx,dy]=Flow(D0,Dt), where dx and dy represent the displacement in the x and y axes. This yields the 3D deformation field Vt=[dx,dy,oz] used in AR, where oz represents the z-axis offsets for 3D visualization. For force sensors, Vt=[fx,fy,fz] is used for 3D visualization, where fx, fy, and fz are the forces in the x, y, and z axes.
The TactAR system consists of a workstation node, RGB camera nodes, tactile camera nodes, a robot controller node, and the VR headset node, synchronized using ROS2.
The asymmetric tokenizer (AT) comprises a 1D-CNN encoder E and a Gated Recurrent Unit (GRU) decoder D. Given an action chunk A∈RT×D, the encoder downsamples it to a latent one Z=E(A)∈Rt×d. The decoder reconstructs the action via $\hat{\mathbf{A} = \mathscr{D}(\text{concat}([\mathbf{Z}, \mathbf{F}^{reduced}]))$, where Freduced is the corresponding tactile representation sequence. The AT is trained using an L1 reconstruction loss and a Kullback-Leibler (KL) penalty loss, as shown in the following equation:
$L_{AT} = \mathbb{E}_{(\mathbf{A}, \mathbf{F}^{reduced})\in\mathcal{D}_{policy} \left[\|\mathbf{A} - \hat{\mathbf{A}\|_1 + \lambda_{KL}L_{KL}\right]}$.
Where:
- LAT is the loss of the asymmetric tokenizer
- A is the action chunk
- A^ is the reconstructed action
- Freduced is the reduced tactile representation
- Dpolicy is the policy learning dataset
- λKL is the coefficient for the Kullback-Leibler divergence loss
- LKL is the Kullback-Leibler divergence loss
The slow policy is modeled as a Diffusion Policy operating on latent action chunks, termed Latent Diffusion Policy (LDP). During training, given the observation O, the gradient field is learned by ϵθ, and the Denoising Diffusion Probabilistic Models (DDPM) training objective is:
LLDP=E(O,A0)∈Dpolicy,k,ϵk∥ϵk−ϵθ(O,Z0+ϵk,k)∥2
Where:
- LLDP is the loss of the latent diffusion policy
- O is the observation
- A0 is the initial action
- Dpolicy is the policy learning dataset
- k is the iteration index
- ϵk is random noise at iteration k
- ϵθ is the learned gradient field
- Z0 is the latent representation of A0
Experiments were conducted on three challenging contact-rich tasks: Peeling, Wiping, and Bimanual Lifting. The tasks require precision, adaptive force control, and bimanual coordination, respectively. The authors compared RDP against Diffusion Policy baselines with visual input only, tactile images, and tactile embeddings.
The results indicate that RDP significantly outperforms the baselines, achieving a performance improvement of over 35% across the three tasks. The experiments also demonstrated that RDP is applicable across different tactile/force sensors, including GelSight Mini, MCTac, and built-in joint torque sensors. Ablation studies validate the importance of the slow-fast hierarchy, relative trajectory prediction, and latency matching for RDP's performance. The paper also shows that TactAR improves the quality of human demonstrations by improving the stability of contact forces.
The paper identifies limitations such as the non-intuitive nature of AR-based tactile feedback, the restriction to two-finger grippers, and the inability to process high-frequency image inputs in the fast policy. Future research directions include improving teleoperation efficiency, extending the system to dexterous hands, incorporating high-frequency visual inputs, and integrating RDP with Vision-Language-Action (VLA) models.