Diffusion Models for Robotic Manipulation: A Survey (2504.08438v2)

Published 11 Apr 2025 in cs.RO and stat.ML

Abstract: Diffusion generative models have demonstrated remarkable success in visual domains such as image and video generation. They have also recently emerged as a promising approach in robotics, especially in robot manipulations. Diffusion models leverage a probabilistic framework, and they stand out with their ability to model multi-modal distributions and their robustness to high-dimensional input and output spaces. This survey provides a comprehensive review of state-of-the-art diffusion models in robotic manipulation, including grasp learning, trajectory planning, and data augmentation. Diffusion models for scene and image augmentation lie at the intersection of robotics and computer vision for vision-based tasks to enhance generalizability and data scarcity. This paper also presents the two main frameworks of diffusion models and their integration with imitation learning and reinforcement learning. In addition, it discusses the common architectures and benchmarks and points out the challenges and advantages of current state-of-the-art diffusion-based methods.

Summary

The paper provides a comprehensive survey on how diffusion models robustly model complex, multi-modal distributions for robotic manipulation tasks such as grasping, trajectory generation, and data augmentation.
It details the mathematical foundations of score-based and denoising diffusion models and describes architectural improvements like faster sampling techniques and robot-specific conditioning.
The survey benchmarks these methods in both simulation and real-world settings, while discussing challenges in generalizability and sampling speed critical for practical robotic applications.

This survey provides a comprehensive review of the application of diffusion models (DMs) in robotic manipulation, covering grasp learning, trajectory planning, and data augmentation (2504.08438). It highlights the advantages of DMs, such as their ability to model complex, multi-modal distributions robustly in high-dimensional spaces, often outperforming methods like GMMs, EBMs, and GANs. The survey notes a significant increase in DM applications in robotics since 2022.

Fundamentals of Diffusion Models:

The paper first introduces the mathematical foundations of two primary DM frameworks:

Score-Based DMs (SMLD/NCSM): Learn the score (gradient of the log-likelihood) of perturbed data distributions to reverse a noise-adding (forward) process using Langevin dynamics.
Denoising Diffusion Probabilistic Models (DDPM): Train a network to predict the noise added during the forward process and use this prediction to iteratively denoise samples in the reverse process.

It discusses key architectural improvements aimed at addressing the slow sampling speed inherent in DMs:

Faster Sampling: Techniques like Denoising Diffusion Implicit Models (DDIM) (2006.11239), SDE/ODE formulations (2011.13456), and specialized solvers (e.g., DPM-Solver (2206.00927)) allow for fewer sampling steps, often using deterministic processes. Non-uniform step sizes and learned noise schedules (iDDPM (2102.09672)) further enhance sample quality and speed.
Robotics Adaptations: Conditioning the denoising process on robot observations (proprioception, visual data like images/point clouds, language instructions) is crucial. Handling temporally correlated data like trajectories often involves predicting subsequences using receding horizon control.

Architectures for Robotic Manipulation:

Three main architectures are used for the denoising network in robotic DMs:

CNNs (Temporal U-Nets): Adapted from image generation U-Nets (1505.04597), using 1D temporal convolutions. Conditioning is often done via FiLM layers (1709.07871). They are generally robust but can cause over-smoothing. (2204.08487, 2303.04137)
Transformers: Process sequences of observations, actions, and time steps as tokens, using attention mechanisms for conditioning. They excel at long-range dependencies but can be hyperparameter-sensitive and resource-intensive. (2303.04137, 2212.09748)
MLPs: Often used in RL settings, computationally efficient but may struggle with high-dimensional (visual) inputs unless combined with encoders. (2208.06193)

The number of sampling steps is a critical trade-off between speed and quality, with DDIM commonly used with 5-10 steps during inference, down from 50-100 training steps.

Applications in Robotic Manipulation:

Trajectory Generation:
- Imitation Learning (IL): DMs generate smooth, multi-modal trajectories conditioned on observations. Key aspects include:
  - Action Representation: Predicting end-effector poses (task space), joint angles (joint space), or even image sequences (image space). Receding horizon control is common.
  - Visual Input: Using 2D images or increasingly 3D data (point clouds, embeddings) for better geometric understanding. (2403.0395, 2402.07799)
  - Long-Horizon/Multi-Task: Addressed via hierarchical planning, skill learning (often using DMs for low-level skills), or integrating Vision-Language-Action models (VLAs) where DMs refine VLA outputs or generate actions directly. (2310.17702, 2410.07864)
  - Constrained Planning: Achieved through classifier guidance (separate model steers diffusion) or classifier-free guidance (conditioning integrated into the DM). (2207.12598, 2112.10741)
- Offline Reinforcement Learning (RL): Integrates rewards into the DM framework:
  - Guidance: Using reward models to guide sampling (Diffuser (2205.09991)).
  - Conditioning: Directly conditioning the DM on returns (Decision Diffuser (2202.13485)).
  - Q-Learning Integration: Modifying the DM loss with a critic (Diffusion-QL (2208.06193)). Offline RL can leverage suboptimal data better than IL but requires careful tuning and often relies on state inputs rather than raw visual data.
Robotic Grasp Generation:
- SE(3) Diffusion: Directly generating 6-DoF grasp poses, addressing the non-Euclidean nature of SE(3) via EBMs on Lie groups (2209.03855), flow matching (2210.02747, 2403.04413), or ensuring SE(3)-equivariance (2306.08941). This applies to both parallel jaw and dexterous grasps (2309.12085).
- Latent Diffusion: Performing diffusion in a latent space learned by a VAE (2309.02194).
- Affordance/Task-Driven: Using language guidance (2403.03181), learning pre-grasp manipulations (2403.12421), synthesizing human-object interactions (HOI) (2403.13798), or diffusing object poses for rearrangement (2212.01793).
Visual Data Augmentation:
- Scaling Data: Using pretrained DMs (like Stable Diffusion (2112.10752)) for inpainting to change textures, objects, or backgrounds, increasing dataset size and diversity for IL/RL (2302.06671).
- Sensor Data Reconstruction: Completing partial point clouds or images from sensors using DM-based inpainting, sometimes combined with view planning (2304.04531, 2403.03733).
- Object Rearrangement: Generating target scene arrangements from language prompts using text-to-image DMs, often combined with other models like LLMs or NeRFs (2211.10959, 2210.02438).

Experiments and Benchmarks:

The survey lists common benchmarks (CALVIN, RLBench, D4RL Kitchen, Meta-World) and baselines (SE(3)-Diffusion Policy, Diffuser, Diffusion Policy, 3D Diffusion Policy). Most methods are evaluated in simulation, with many also tested on real robots, though often trained on real data or requiring sim-to-real techniques.

Conclusion, Limitations, and Outlook:

DMs excel at modeling multi-modal distributions and handling high-dimensional data, making them powerful tools for robotic manipulation. Key limitations remain:

Generalizability: Performance often depends heavily on training data quality and diversity (covariate shift in IL, distribution shift in offline RL). Data augmentation helps but has limits. VLAs offer promise but need refinement.
Sampling Speed: Iterative sampling remains a bottleneck for real-time control, despite improvements like DDIM. Faster samplers need more investigation in robotics contexts.

Future directions include exploring faster sampling methods, improving generalizability through continual learning and foundation model integration, enhancing robustness in complex/occluded scenes, and leveraging semantic reasoning capabilities.

PDF Markdown

Tweets

https://twitter.com/OWW/status/1911962758897184826

https://twitter.com/StatMLPapers/status/1911981234630127752

Diffusion Models for Robotic Manipulation: A Survey (2504.08438v2)

Summary

Related Papers

Tweets