Effective training of flow policies to sample from Boltzmann distributions

Develop an effective training methodology for flow matching policies that produces samples from the Boltzmann distribution over actions defined by a learned Q-function in maximum entropy online reinforcement learning, overcoming the absence of direct target samples and the intractability of the unnormalized Boltzmann density.

Background

Prior methods targeting Boltzmann distributions in online reinforcement learning had been limited to diffusion policies, leaving flow policies without a principled approach to sample from the Q-induced Boltzmann action distribution.

The paper identifies this gap explicitly as an open problem and proposes a reverse inferential framework (Reverse Flow Matching) that extends targeting Boltzmann distributions from diffusion to flow policies.

References

Furthermore, these methods have been limited to diffusion policies, leaving the effective training of flow policies to sample from Boltzmann distributions as an open problem.

Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies  (2601.08136 - Li et al., 13 Jan 2026) in Section 1 (Introduction)