Score Regularized Policy Optimization through Diffusion Behavior (2310.07297v3)

Published 11 Oct 2023 in cs.LG

Abstract: Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.

Citations (12)

View on Semantic Scholar

Summary

The paper introduces SRPO, an offline RL algorithm that regularizes policy gradients using score functions from pre-trained diffusion models.
It extracts a deterministic policy from a critic and diffusion model to bypass the slow iterative diffusion sampling process.
Experiments demonstrate state-of-the-art performance and a 25x boost in action sampling speed on D4RL locomotion tasks.

The paper "Score Regularized Policy Optimization through Diffusion Behavior" introduces a new offline reinforcement learning (RL) algorithm, SRPO, that leverages the strengths of diffusion models for representing complex behavior policies while avoiding their computational bottleneck during action sampling.

Problem: Offline RL aims to learn policies from pre-collected datasets without further environment interaction. Behavior regularization is crucial, ensuring the learned policy stays close to the data distribution. Diffusion models are powerful for representing complex, multimodal behavior policies, but their slow iterative sampling process hinders their practical use in RL, especially in computationally sensitive domains like robotics.

Proposed Solution (SRPO): The core idea is to extract a simple, deterministic policy from a critic (Q-function) and a pre-trained diffusion behavior model. Instead of directly sampling from the diffusion policy, SRPO regularizes the policy gradient using the score function of the behavior distribution, which is effectively approximated by the pre-trained diffusion model. This avoids diffusion sampling during both training and evaluation.

Key Contributions and Technical Details:

Score Regularized Policy Optimization: The paper derives a policy optimization objective where the gradient of the KL divergence between the learned policy and the behavior policy is related to the score function of the behavior distribution. The pre-trained diffusion model is then used to approximate this score function, directly regularizing the policy gradient.
Deterministic Policy Extraction: SRPO extracts a deterministic policy to avoid iterative diffusion sampling. The optimization process encourages a mode-seeking behavior due to the reverse KL objective.
Practical Algorithm: The paper provides a practical algorithm combining SRPO with Implicit Q-Learning (IQL) and continuous-time diffusion behavior modeling.
Techniques inspired by Text-to-3D Research: Analogous to DreamFusion, the paper leverages an ensemble of score approximations under different diffusion times to leverage the pretrained behavior model more completely. It also employs a baseline term (subtracting action noise) to reduce variance in gradient estimation, stabilizing training.

Methodology:

Pre-train Diffusion Behavior Model: A conditional diffusion model is trained to approximate the behavior policy from the offline dataset. The model predicts noise added to diffused action samples.
Train Q-Networks: Implicit Q-learning (IQL) is used to estimate the Q-function based on expectile regression.
Extract Deterministic Policy: The deterministic policy is optimized using the gradient of the Q-function, regularized by the score function of the behavior policy, which is approximated using the pretrained diffusion model.

Experiments and Results:

D4RL Benchmarks: Evaluated SRPO on D4RL locomotion and AntMaze tasks. Achieved state-of-the-art performance in locomotion tasks, comparable to other diffusion-based methods but with significantly improved computational efficiency.
Computational Efficiency: Demonstrated a 25x or greater boost in action sampling speed and significantly lower computational cost compared to other diffusion-based methods.
2D Bandit Experiments: Showed that SRPO successfully constrains the learned policy to various complex behavior distributions in 2D environments. These experiments illustrate how varying a temperature parameter modulates the transition between a purely greedy (Q-function maximizing) policy and a conservative behavior-regularized policy. It also shows how SRPO performs relative to other behavior constrained methods like BCQ and BEAR.
Ablation Studies: Analyzed the impact of different implementation details, such as the weighting function for ensembling diffusion times and the baseline term for variance reduction.

Key Advantages:

Computational Efficiency: Avoids diffusion sampling during both training and evaluation, enabling faster action sampling and reducing computational cost.
Strong Generative Capability: Leverages diffusion models for representing complex and heterogeneous behavior policies.
Mode-Seeking Behavior: Extracts a deterministic policy from critic and diffusion behavior models.
Improved Stability: Utilizes baseline techniques to reduce variance and stabilize training.

In summary, SRPO is a novel and efficient offline RL algorithm that bridges the gap between the expressiveness of diffusion models and the computational constraints of real-world applications. The method presents a clever way to achieve behavior regularization at the gradient level, leading to significant speedups and making diffusion-based offline RL more practical.

PDF Markdown

Score Regularized Policy Optimization through Diffusion Behavior (2310.07297v3)

Summary

Related Papers