- The paper introduces CFGRL, a framework that fuses diffusion model guidance with reinforcement learning to controllably improve policies via a tunable guidance weight.
- It demonstrates that a product policy, combining a reference policy and an optimality factor, can be systematically enhanced by adjusting the guidance weight during sampling.
- CFGRL achieves notable performance gains in offline RL and goal-conditioned behavioral cloning, offering a practical, test-time adjustable alternative to traditional policy extraction methods.
Reinforcement learning (RL) faces significant challenges in scaling, particularly when learning from offline datasets which often contain suboptimal trajectories. In contrast, generative modeling techniques like diffusion models have demonstrated remarkable scalability and stability in training, finding applications in areas like behavioral cloning. The paper "Diffusion Guidance Is a Controllable Policy Improvement Operator" (2505.23458) introduces a framework, CFGRL, that aims to combine the policy improvement capabilities of RL with the simple, scalable training of generative models.
The core idea is to establish a direct relationship between policy improvement and the guidance mechanism used in diffusion models. The paper defines policies as a product of a reference policy π^(a∣s) and an "optimality" distribution f(A(s,a)) proportional to a non-negative, monotonically increasing function of the advantage A(s,a). That is, π(a∣s)∝π^(a∣s)f(A(s,a)). The paper proves that if f is a non-negative, non-decreasing function of the advantage under the reference policy π^, then the resulting product policy π is an improvement over π^. Furthermore, sampling from an attenuated version, πw(a∣s)∝π^(a∣s)f(A(s,a))w for w2>w1≥0, results in πw2 being an improvement over πw1. This implies that increasing the weight w can lead to further policy improvement, although it also increases the deviation from the reference policy. This trade-off between optimality and adherence is central to many offline RL algorithms.
The key practical insight is that sampling from such product distributions can be achieved naturally using diffusion guidance. Diffusion models learn the score function ∇alogp(a), and for a product distribution p(a)∝p1(a)p2(a), the score is additive: ∇alogp(a)=∇alogp1(a)+∇alogp2(a). The paper leverages classifier-free guidance, a technique from generative modeling, to compose the reference policy factor and the optimality factor. Instead of explicitly learning the optimality distribution p(o∣s,a), it can be implicitly handled by training a policy conditioned on an optimality variable o, denoted π^(a∣s,o). Using Bayes' rule, the score of the product policy can be expressed as:
∇alogπ(a∣s)=∇alogπ^(a∣s)+w(∇alogπ^(a∣s,o)−∇alogπ^(a∣s))
Here, w is a guidance weight. The term ∇alogπ^(a∣s) corresponds to the score of the unconditional reference policy, and the term (∇alogπ^(a∣s,o)−∇alogπ^(a∣s)) represents the contribution of the optimality conditioning. A single network can be trained to represent both the conditional and unconditional policies, and the guidance weight w can be adjusted during sampling to control the degree of policy improvement. This means the trade-off between adherence to the prior and optimality can be tuned at test time without retraining.
CFGRL implements this by training a single diffusion model (specifically, a flow model) to predict the velocity field vθ(at,t,s,o) where at is a partially noised action, t is the noise scale, s is the state, and o is the optimality variable. The network is trained with a standard diffusion/flow matching objective:
L(θ)=Es,a∼D[∥vθ(at,t,s,o)−(a−a0)∥2]
where at=(1−t)a0+ta, t∼U(0,1), and a0∼N(0,1). The optimality variable o is typically a binary indicator. For policy extraction in offline RL, o=1 can be defined for data points (s,a) where the learned advantage A(s,a)≥0, and o=0 otherwise. Crucially, the training loss uses uniform weighting, avoiding the peaked gradients seen in methods like Advantage-Weighted Regression (AWR).
The sampling process involves initializing with noise and iteratively applying the velocity field vθ, incorporating the guidance as shown in Algorithm 2:
1
2
3
4
5
6
7
8
9
10
11
|
Algorithm: CFGRL Sampling
Input: state s, optimality condition o (e.g., o=1 for optimal actions), guidance weight w
Initialize action a ~ N(0,I)
Initialize time t = 0
For n in [0, ..., N-1]:
v_uncond = v_theta(a, t, s, empty_o) # empty_o represents unconditional conditioning
v_cond = v_theta(a, t, s, o)
v_guided = (1 - w) * v_uncond + w * v_cond
a = a + (n/N) * v_guided # Simplified velocity update
t = t + (n/N) # Simplified time update
Return a |
(Note: The paper's Algorithm 2 provides a simplified velocity/time update rule for illustration; actual diffusion/flow sampling uses more complex schedules).
Applications in Offline RL:
A common practice in offline RL is to first learn a Q-function and then extract a policy that maximizes this Q-function while staying close to the data distribution (e.g., using AWR). AWR trains a policy with a weighted supervised objective: JAWR(θ)=E(s,a)∼D[logπθ(a∣s)exp(A(s,a)×(1/β))]. This objective can suffer from highly variable weights, concentrating the learning signal on a few high-advantage examples.
CFGRL offers an alternative policy extraction method. By setting the optimality variable o based on A(s,a)≥0, CFGRL trains a conditional diffusion model without explicit data weighting. The guidance weight w in CFGRL serves a similar purpose to AWR's temperature 1/β – controlling the trade-off between prior adherence and optimality. However, CFGRL allows tuning w at test time. Experiments on ExORL and OGBench tasks show that CFGRL generally achieves higher performance than AWR, suggesting its policy extraction mechanism is more effective.
Applications in Goal-Conditioned Behavioral Cloning (GCBC):
GCBC is a simple method for goal-conditioned RL that avoids explicit value function learning by treating goal reaching as the "optimality" signal. It trains a policy π(a∣s,g) by maximizing the likelihood of actions that lead to state g in the future. While simple, it's suboptimal if the data is. The paper shows that standard GCBC is implicitly a CFGRL policy with w=1, where the optimality o is related to reaching the goal g. By using CFGRL with w>1, improvement over the naive GCBC policy can be obtained. This improvement is achieved simply by contrasting the goal-conditioned policy with an unconditional policy, without needing to train a value function. Experiments on OGBench state-based and visual tasks demonstrate that CFGRL consistently outperforms BC, Flow BC, GCBC, and Flow GCBC, often by a significant margin. Hierarchical versions of CFGRL also show strong performance gains.
Implementation Considerations:
- CFGRL relies on training a diffusion or flow model, which can be computationally more intensive than training simple feedforward policies as in traditional BC or AWR, especially in terms of sampling time (multiple steps are required).
- The optimality variable o needs to be defined based on the task or available data. For offline RL with a learned value function, A(s,a)≥0 is a natural choice. For GCBC, o is essentially tied to the goal g.
- The guidance weight w is a key hyperparameter. While it can be swept at test time, finding the optimal w might still require running multiple evaluations. The paper shows performance generally increases with w up to a point before distribution shift becomes detrimental.
- The choice of base generative model (diffusion, flow) and its architecture (MLP for state, CNN for pixels) affects performance and resource requirements. The paper uses flow matching, finding it effective.
- When implementing, ensure the conditional and unconditional inputs to the velocity network are correctly handled, often via a learnable embedding for the optimality variable, as suggested in the ablation studies.
CFGRL provides a practical, simple-to-train method for policy improvement by leveraging generative model guidance. It offers a plug-and-play replacement for policy extraction in offline RL and unlocks performance gains in goal-conditioned imitation learning without requiring value functions. The ability to control improvement at test time via the guidance weight is a significant practical advantage. The code repository at \url{https://github.com/kvfrans/cfgrl} provides implementations for reproducing the experiments and serves as a starting point for applying CFGRL.