Diffusion Models for Robotic Manipulation: A Survey (2504.08438v1)
Abstract: Diffusion generative models have demonstrated remarkable success in visual domains such as image and video generation. They have also recently emerged as a promising approach in robotics, especially in robot manipulations. Diffusion models leverage a probabilistic framework, and they stand out with their ability to model multi-modal distributions and their robustness to high-dimensional input and output spaces. This survey provides a comprehensive review of state-of-the-art diffusion models in robotic manipulation, including grasp learning, trajectory planning, and data augmentation. Diffusion models for scene and image augmentation lie at the intersection of robotics and computer vision for vision-based tasks to enhance generalizability and data scarcity. This paper also presents the two main frameworks of diffusion models and their integration with imitation learning and reinforcement learning. In addition, it discusses the common architectures and benchmarks and points out the challenges and advantages of current state-of-the-art diffusion-based methods.
Summary
- The paper provides a comprehensive survey on how diffusion models robustly model complex, multi-modal distributions for robotic manipulation tasks such as grasping, trajectory generation, and data augmentation.
- It details the mathematical foundations of score-based and denoising diffusion models and describes architectural improvements like faster sampling techniques and robot-specific conditioning.
- The survey benchmarks these methods in both simulation and real-world settings, while discussing challenges in generalizability and sampling speed critical for practical robotic applications.
This survey provides a comprehensive review of the application of diffusion models (DMs) in robotic manipulation, covering grasp learning, trajectory planning, and data augmentation (Diffusion Models for Robotic Manipulation: A Survey, 11 Apr 2025). It highlights the advantages of DMs, such as their ability to model complex, multi-modal distributions robustly in high-dimensional spaces, often outperforming methods like GMMs, EBMs, and GANs. The survey notes a significant increase in DM applications in robotics since 2022.
Fundamentals of Diffusion Models:
The paper first introduces the mathematical foundations of two primary DM frameworks:
- Score-Based DMs (SMLD/NCSM): Learn the score (gradient of the log-likelihood) of perturbed data distributions to reverse a noise-adding (forward) process using Langevin dynamics.
- Denoising Diffusion Probabilistic Models (DDPM): Train a network to predict the noise added during the forward process and use this prediction to iteratively denoise samples in the reverse process.
It discusses key architectural improvements aimed at addressing the slow sampling speed inherent in DMs:
- Faster Sampling: Techniques like Denoising Diffusion Implicit Models (DDIM) (Denoising Diffusion Probabilistic Models, 2020), SDE/ODE formulations (Score-Based Generative Modeling through Stochastic Differential Equations, 2020), and specialized solvers (e.g., DPM-Solver (DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps, 2022)) allow for fewer sampling steps, often using deterministic processes. Non-uniform step sizes and learned noise schedules (iDDPM (Improved Denoising Diffusion Probabilistic Models, 2021)) further enhance sample quality and speed.
- Robotics Adaptations: Conditioning the denoising process on robot observations (proprioception, visual data like images/point clouds, language instructions) is crucial. Handling temporally correlated data like trajectories often involves predicting subsequences using receding horizon control.
Architectures for Robotic Manipulation:
Three main architectures are used for the denoising network in robotic DMs:
- CNNs (Temporal U-Nets): Adapted from image generation U-Nets (U-Net: Convolutional Networks for Biomedical Image Segmentation, 2015), using 1D temporal convolutions. Conditioning is often done via FiLM layers (FiLM: Visual Reasoning with a General Conditioning Layer, 2017). They are generally robust but can cause over-smoothing. (Characterizing Observed Extra Mixing Trends in Red Giants using the Reduced Density Ratio from Thermohaline Models, 2022, Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, 2023)
- Transformers: Process sequences of observations, actions, and time steps as tokens, using attention mechanisms for conditioning. They excel at long-range dependencies but can be hyperparameter-sensitive and resource-intensive. (Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, 2023, Scalable Diffusion Models with Transformers, 2022)
- MLPs: Often used in RL settings, computationally efficient but may struggle with high-dimensional (visual) inputs unless combined with encoders. (Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning, 2022)
The number of sampling steps is a critical trade-off between speed and quality, with DDIM commonly used with 5-10 steps during inference, down from 50-100 training steps.
Applications in Robotic Manipulation:
- Trajectory Generation:
- Imitation Learning (IL): DMs generate smooth, multi-modal trajectories conditioned on observations. Key aspects include:
- Action Representation: Predicting end-effector poses (task space), joint angles (joint space), or even image sequences (image space). Receding horizon control is common.
- Visual Input: Using 2D images or increasingly 3D data (point clouds, embeddings) for better geometric understanding. (2403.0395, Generalising Planning Environment Redesign, 12 Feb 2024)
- Long-Horizon/Multi-Task: Addressed via hierarchical planning, skill learning (often using DMs for low-level skills), or integrating Vision-Language-Action models (VLAs) where DMs refine VLA outputs or generate actions directly. (Existence of global weak solutions to NSE in weighted spaces, 2023, RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation, 10 Oct 2024)
- Constrained Planning: Achieved through classifier guidance (separate model steers diffusion) or classifier-free guidance (conditioning integrated into the DM). (Classifier-Free Diffusion Guidance, 2022, GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, 2021)
- Offline Reinforcement Learning (RL): Integrates rewards into the DM framework:
- Guidance: Using reward models to guide sampling (Diffuser (Planning with Diffusion for Flexible Behavior Synthesis, 2022)).
- Conditioning: Directly conditioning the DM on returns (Decision Diffuser (Pareto-Rational Verification, 2022)).
- Q-Learning Integration: Modifying the DM loss with a critic (Diffusion-QL (Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning, 2022)). Offline RL can leverage suboptimal data better than IL but requires careful tuning and often relies on state inputs rather than raw visual data.
- Imitation Learning (IL): DMs generate smooth, multi-modal trajectories conditioned on observations. Key aspects include:
- Robotic Grasp Generation:
- SE(3) Diffusion: Directly generating 6-DoF grasp poses, addressing the non-Euclidean nature of SE(3) via EBMs on Lie groups (SE(3)-DiffusionFields: Learning smooth cost functions for joint grasp and motion optimization through diffusion, 2022), flow matching (Flow Matching for Generative Modeling, 2022, Sharp estimates for convolution operators associated to hypersurfaces in R<sup>3 with height h≤2, 7 Mar 2024), or ensuring SE(3)-equivariance (Exploring Resolution Fields for Scalable Image Compression with Uncertainty Guidance, 2023). This applies to both parallel jaw and dexterous grasps (Techno-Economic Analysis of Synthetic Fuel Production from Existing Nuclear Power Plants across the United States, 2023).
- Latent Diffusion: Performing diffusion in a latent space learned by a VAE (Estimating stellar population and emission line properties in S-PLUS galaxies, 2023).
- Affordance/Task-Driven: Using language guidance (Behavior Generation with Latent Actions, 5 Mar 2024), learning pre-grasp manipulations (Dexterous Functional Pre-Grasp Manipulation with Diffusion Policy, 19 Mar 2024), synthesizing human-object interactions (HOI) (Hierarchical NeuroSymbolic Approach for Comprehensive and Explainable Action Quality Assessment, 20 Mar 2024), or diffusing object poses for rearrangement (kHGCN: Tree-likeness Modeling via Continuous and Discrete Curvature Learning, 2022).
- Visual Data Augmentation:
- Scaling Data: Using pretrained DMs (like Stable Diffusion (High-Resolution Image Synthesis with Latent Diffusion Models, 2021)) for inpainting to change textures, objects, or backgrounds, increasing dataset size and diversity for IL/RL (GenAug: Retargeting behaviors to unseen situations via Generative Augmentation, 2023).
- Sensor Data Reconstruction: Completing partial point clouds or images from sensors using DM-based inpainting, sometimes combined with view planning (Alon-Tarsi Number of Some Regular Graphs, 2023, Energy and Time Histograms: Fluxes determination in calorimeters, 6 Mar 2024).
- Object Rearrangement: Generating target scene arrangements from language prompts using text-to-image DMs, often combined with other models like LLMs or NeRFs (Proton quark distributions from a light-front Faddeev-Bethe-Salpeter approach, 2022, DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics, 2022).
Experiments and Benchmarks:
The survey lists common benchmarks (CALVIN, RLBench, D4RL Kitchen, Meta-World) and baselines (SE(3)-Diffusion Policy, Diffuser, Diffusion Policy, 3D Diffusion Policy). Most methods are evaluated in simulation, with many also tested on real robots, though often trained on real data or requiring sim-to-real techniques.
Conclusion, Limitations, and Outlook:
DMs excel at modeling multi-modal distributions and handling high-dimensional data, making them powerful tools for robotic manipulation. Key limitations remain:
- Generalizability: Performance often depends heavily on training data quality and diversity (covariate shift in IL, distribution shift in offline RL). Data augmentation helps but has limits. VLAs offer promise but need refinement.
- Sampling Speed: Iterative sampling remains a bottleneck for real-time control, despite improvements like DDIM. Faster samplers need more investigation in robotics contexts.
Future directions include exploring faster sampling methods, improving generalizability through continual learning and foundation model integration, enhancing robustness in complex/occluded scenes, and leveraging semantic reasoning capabilities.