Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies

Published 13 Jan 2026 in cs.LG and eess.SY | (2601.08136v1)

Abstract: Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty in online RL is the lack of direct samples from the target distribution; instead, the target is an unnormalized Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which utilizes a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. Yet, it remains unclear how these objectives relate formally or if they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that effectively reduce importance sampling variance. We show that existing noise-expectation and gradient-expectation methods are two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and enables the principled combination of Q-value and Q-gradient information to derive an optimal, minimum-variance estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL, and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces RFM, a unified framework that trains both diffusion and flow policies by approximating unnormalized Boltzmann distributions in online RL.
It leverages a reverse inferential perspective and Langevin Stein control variates to reduce estimator variance and stabilize policy optimization.
Empirical studies show that RFM delivers superior performance and reduced variance across continuous control tasks compared to existing baselines.

Reverse Flow Matching: A Unified Approach for Training Diffusion and Flow Policies in Online Reinforcement Learning

Introduction and Motivation

Recent advances in generative modeling have underpinned the enhanced expressiveness of policy classes for sequential decision-making tasks, particularly through diffusion and flow models. While these approaches have achieved strong results in settings such as imitation learning and offline RL—where direct samples from the target action distribution are accessible—the online RL regime presents a fundamental challenge: the target policy distribution, typically a Boltzmann distribution over actions induced by the learned $Q$ -function, is unnormalized and intractable to sample from directly. This distinction impedes the adoption of diffusion and flow policies in online RL, where existing training objectives either suffer from high variance/bias or are tightly coupled to specific parameterizations and fail to generalize to both diffusion and flow architectures.

This paper introduces a unified, statistically rigorous framework—Reverse Flow Matching (RFM)—for directly training both diffusion and flow models to approximate unnormalized Boltzmann action distributions in online RL. RFM views policy training through a reverse inferential lens, transforming the challenge of missing target samples into a tractable posterior mean estimation problem. Furthermore, it employs Langevin Stein operator-based control variates to reduce estimator variance, enabling stable and efficient policy optimization. The framework generalizes and unifies prior approaches, revealing noise-expectation and gradient-expectation methods as special cases and enabling, for the first time, principled minimum-variance estimators by combining $Q$ -value and $Q$ -gradient information.

Technical Framework

Online RL with Boltzmann-Induced Policies

Let $Q(s, a)$ denote the soft state-action value function, and consider maximum entropy RL where the policy update step aims to match the Boltzmann distribution: $\pi_{\text{new}}(a \mid s) \propto \exp\left(\frac{1}{\lambda} Q(s, a)\right)$ Because direct sampling from this distribution is infeasible, naively differentiating through diffuser/flow-based policy samplers accumulates significant computational overhead and instability.

Reverse Flow Matching: Problem Formulation

RFM circumvents the direct sampling bottleneck by inverting the generative process: rather than constructing noisy samples from known source and target, it infers the unknown latent (source or target) given observations. Using linear interpolation between source and target (with schedule $(\alpha_t, \beta_t)$ ), RFM formulates the posterior distribution over the latent as: $q_{0|t}^*(x_0 \mid x_t) \propto p_0(x_0)\,p_1\left(\frac{1}{\alpha_t}x_t - \frac{\beta_t}{\alpha_t}x_0\right)$ Here, $p_1$ is the unnormalized Boltzmann distribution. Policy training becomes regression of the velocity field toward the posterior mean of the endpoints, which is itself estimated via sampling from $q_{0|t}^*$ . This reformulation generalizes to both flow (deterministic ODE) and diffusion (stochastic SDE) models.

Unified Objective and Prior Methods

The RFM objective encompasses both noise-expectation (average over noise, leveraging only $Q$ ) and gradient-expectation (average over $Q$ gradients) training approaches as instances of a more general posterior mean estimator. Letting $\eta \in [0,1]$ interpolate between the two, the minimum-variance estimator linearly combines $Q$ - and $\nabla Q$ -based targets. Prior approaches correspond to fixed values of $\eta$ : $\eta=0$ recovers noise-expectation [ma2025efficient, dong2025maximum], and $\eta=1$ recovers gradient-expectation methods [akhound2024iterated, jain2025sampling]. The optimal $\eta$ (and general diagonal weighting $\Lambda$ ) can be computed to minimize estimator variance, substantially stabilizing policy optimization.

Variance Reduction via Langevin Stein Control Variates

Estimating the posterior means under $q_{0|t}^*$ is nontrivial due to the unnormalized nature of $p_1$ . RFM introduces Langevin Stein operators to construct zero-mean control variates over self-normalized importance sampling (SNIS), yielding a family of estimators of form: $X_0 + \text{Stein control variate}$ with the Stein operator leveraging both the gradient of the source and target densities. Optimal coefficients for the control variate are analytically derived to minimize estimator variance. This variance-reduction mechanism is theoretically justified and empirically shown to be crucial for stable high-performance training.

Practical Algorithm and Empirical Evaluation

Integration with Online RL and Flow Policies

RFM is instantiated with flow-matching velocity parameterizations and evaluated on continuous-control environments from the DeepMind Control Suite. Actions are sampled via ODE integration with the learned velocity field. The critic uses a double Q-network architecture; policy and Q-network updates proceed with standard off-policy RL sampling, and posterior means for RFM losses are computed with SNIS augmented by learned Stein control variates.

Empirical Results

RFM outperforms representative baselines—including SAC (mean-field Gaussian policy), Q-score matching (QSM, Langevin sampling from score models), Q-weighted noise estimation (QNE), and diffusion Q-sampling (DQS)—across all eight evaluation environments. It is the only approach to demonstrate both high average performance and low variance across all tested domains.

Figure 1: Training curves on eight environments demonstrate RFM's consistent superiority in mean episodic return and training stability compared to all baselines.

Strong numerical results show that RFM's improved posterior mean estimation directly yields quantifiable gains in both task performance and learning stability, particularly in complex or multi-modal continuous control scenarios.

Theoretical Implications and Generalizations

RFM provides a principled unification of previously disparate methods by showing they are all instantiations of its general posterior mean estimation principle. Its construction allows, for the first time, the extension of Boltzmann-targeted training to both diffusion and flow policies, as well as the principled inclusion of both $Q$ -value and $Q$ -gradient information with variance-optimal combination.

Moreover, the control variate formalism based on Langevin Stein operators has broader implications: it is not only crucial for variance reduction in RL but may also be employed in general generative modeling, semi-supervised learning, and Bayesian inference settings where target distributions are unnormalized or otherwise intractable. Notably, this framework is flexible enough to accommodate further amortized or learned control variates beyond the diagonal or isotropic forms discussed in the paper.

Limitations and Future Developments

While RFM significantly advances training methodology for expressive policies in RL, practical scalability to ultra-high-dimensional action spaces or environments with complex temporal structure may introduce additional computational considerations (e.g., per-iteration cost of posterior estimation, integration with model-based or hierarchical decomposition in RL).

The reverse-inference perspective and control variate construction also naturally suggest future extensions:

Learned or amortized control variates to further reduce estimation variance across state, time, and environment;
Extensions to offline RL and imitation settings leveraging the RFM posterior mean formulation for general unnormalized targets;
Generalization to other structured generative models for policy learning or structured output tasks, including more sophisticated source distributions or hybrid latent structures.

Conclusion

Reverse Flow Matching formalizes and extends policy training for expressive generative models in online RL. By transforming the intractable sampling challenge of Boltzmann action distributions into a posterior mean estimation problem, and by employing rigorously constructed variance-reduced estimators, RFM establishes a unified objective encompassing and improving upon all prior methods. Empirically, it achieves uniformly strong performance and stability across standard RL benchmarks. Theoretically, RFM's foundations in statistical estimation and control variates open further avenues for robust generative policy learning and scalable approximate inference in AI systems.

Markdown Report Issue