Effective training of flow policies to sample from Boltzmann distributions
Develop an effective training methodology for flow matching policies that produces samples from the Boltzmann distribution over actions defined by a learned Q-function in maximum entropy online reinforcement learning, overcoming the absence of direct target samples and the intractability of the unnormalized Boltzmann density.
References
Furthermore, these methods have been limited to diffusion policies, leaving the effective training of flow policies to sample from Boltzmann distributions as an open problem.
— Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies
(2601.08136 - Li et al., 13 Jan 2026) in Section 1 (Introduction)