Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Steering Your Diffusion Policy with Latent Space Reinforcement Learning (2506.15799v1)

Published 18 Jun 2025 in cs.RO and cs.LG

Abstract: Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies -- a state-of-the-art BC methodology -- we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Andrew Wagenmaker (20 papers)
  2. Mitsuhiko Nakamoto (5 papers)
  3. Yunchu Zhang (8 papers)
  4. Seohong Park (18 papers)
  5. Waleed Yagoub (1 paper)
  6. Anusha Nagabandi (10 papers)
  7. Abhishek Gupta (226 papers)
  8. Sergey Levine (531 papers)

Summary

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

This paper introduces a novel approach titled 'Diffusion Steering via Reinforcement Learning' (Dsrl), which aims to autonomously improve behavioral cloning (BC)-trained robotic control policies using efficient real-world reinforcement learning (RL). Central to the method is the adaptation of state-of-the-art diffusion-based BC policies by optimizing over their latent noise space, rather than modifying the underlying policy weights. This novel approach is motivated by the inefficiencies and complexities associated with finetuning existing diffusion policies directly, which can be both computation-intensive and sensitive to sampling inefficiencies.

Overview of Diffusion Steering

Diffusion policies, which employ state-of-the-art generative modeling techniques like denoising diffusion probabilistic models (DDPMs) or flow matching, present certain challenges when it comes to adaptation. Traditional finetuning approaches often modify model weights, requiring extensive online interaction data and computational resources. Instead, Dsrl circumvents these issues by altering the distribution of latent noise input, effectively steering the output distribution to achieve better task performance.

Dsrl takes advantage of the symmetry and robustness of diffusion models' latent noise space, conducting RL over this space to optimize performance in new environments. The method notably maintains highly sample-efficient learning and black-box adaptation, requiring only generated actions from the policy without needing access to its internal weights. This significantly reduces the computational overhead and makes the approach readily applicable to a broader range of settings, including scenarios where only API-level access to the BC policy is available.

Experimental Results

Dsrl is demonstrated across multiple simulated and real-world domains. Through experiments on robotic manipulation tasks from the Robomimic and OpenAI Gym benchmarks, Dsrl showed substantial improvements in sample efficiency over existing finetuning methods, achieving comparable or superior performance with notably fewer episodes. The paper reports that, in some cases, Dsrl was able to enhance policy success rates from 20% to 90% with fewer than 50 episodes of interaction.

Furthermore, Dsrl was evaluated on more complex real-world scenarios involving multi-task robotic arms. The approach showed that it could adapt pretrained policies to achieve state-of-the-art performance in novel tasks. An essential highlight is its ability to enhance the behavior of existing generalist policies, particularly the π0\pi_0 model. Dsrl significantly improved π0\pi_0's task success rate, reinforcing the method's potential for reliable and robust policy adaptation in diverse application environments.

Implications and Future Developments

The primary theoretical implication of Dsrl is its potential to simplify the adaptation process of highly complex models, effectively enabling real-time policy improvement without conventional finetuning. Practically, the approach suggests that robust autonomous adaptation can become more accessible, allowing practitioners to deploy models with the assurance that these models can self-optimize in their operating environments.

Future research may explore the broader applicability of latent space optimization across different domains, such as reinforcement learning tasks with high-dimensional input spaces or even creative domains like procedural content generation. Additionally, understanding the theoretical bounds and limitations in policy expressiveness when using such noise-space strategies could further enhance the approach's robustness.

The paper opens potential avenues for integrating Dsrl with other reinforcement learning subfields, such as meta-reinforcement learning, to enhance agent generalization and broad pattern recognition or exploration strategies for data-sparse environments. The approach might be extended to benefit various application areas beyond robotics, such as autonomous navigation and intelligent systems management.

Youtube Logo Streamline Icon: https://streamlinehq.com