Discrete Flow Matching for Offline-to-Online Reinforcement Learning

Published 12 May 2026 in cs.LG and cs.AI | (2605.12379v1)

Abstract: Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without losing useful behavior learned from static data. To address those challenges, we introduce DRIFT, an online fine-tuning method that updates an offline pretrained continuous-time Markov chain (CTMC) policy with an advantage-weighted discrete flow matching loss. To preserve useful pretrained knowledge, we add a path-space penalty that regularizes the full CTMC trajectory distribution, rather than only the final action distribution. For large discrete action spaces, we introduce a candidate-set approximation that updates the actor over a small subset of actions sampled from reference-policy rollouts and uniform exploration. Our theoretical analysis shows that the candidate-set error is controlled by missing target probability mass, and the induced CTMC generator error decreases as the candidate set covers more high-probability actions. Experiments on prevailing discrete action RL task show that our method provides stable offline-to-online improvement across all tasks, achieving the highest average score on Jericho with a simple GRU encoder while outperforming methods that use pretrained LLMs. Controlled experiments further confirm that the path-space penalty remains bounded during fine-tuning and that the CTMC generator adapts to shifted rewards faster than deterministic baselines. The candidate-set mechanism is supported by a stability analysis showing that the generator error decreases exponentially with candidate coverage.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces DRIFT, a method that leverages discrete generative flow matching via CTMCs for effective offline-to-online reinforcement learning.
The methodology applies path-space KL regularization and candidate-set approximations to balance offline pretraining with online adaptation and mitigate catastrophic forgetting.
Experimental results across text games, MinAtar, and discretized MuJoCo demonstrate significant performance gains and faster adaptation compared to traditional RL approaches.

Discrete Flow Matching for Offline-to-Online Reinforcement Learning: Expert Essay

Motivation and Context

Discrete action spaces are prevalent in RL domains such as recommendation, combinatorial control, text games, and scheduling. While generative RL policies based on diffusion and flow matching have demonstrated multimodal action modeling in offline RL, these approaches have largely remained restricted to continuous control. The transition from offline data-driven policy learning to stable and proficient online fine-tuning (offline-to-online RL) is nontrivial, requiring retention of beneficial offline behaviors while efficiently adapting to new environmental signals. Existing generative approaches for discrete spaces—primarily discrete flow matching via continuous-time Markov chains (CTMC)—have not addressed the (non-greedy) online adaptation problem. This paper introduces DRIFT, a principled method for discrete generative policy fine-tuning that leverages CTMC flow matching, path-space regularization, and scalable candidate-set approximations.

Methodology

DRIFT initiates with a reference CTMC generator $u_{\mathrm{ref}}$ pretrained via advantage-weighted discrete flow matching from offline datasets. Fine-tuning proceeds by iteratively:

Collecting online data through CTMC simulation and environment interaction.
Updating critic and value networks using a mixed replay buffer (a combination of online and offline transitions).
Constructing a reward-weighted, advantage-anchored target policy sampled over a candidate set of high-probability actions—assembled via reference rollouts and uniform exploration—to mitigate computational intractability in large action spaces.
Updating the CTMC generator with a discrete flow matching actor objective steering towards the constructed target, combined with a path-space KL penalty regularizing the entire trajectory measure induced by the CTMC (not just terminal action probability).

The CTMC generator is parameterized using softplus-transformed logits for off-diagonal rates, with diagonal rates enforcing probability conservation. The actor loss regresses the learned generator against the independent-coupling transport generator, ensuring consistency with the prescribed probability path, while the path-space KL is estimated via Monte Carlo trajectories, penalizing deviations at both transition and holding times.

Theoretical Guarantees

The candidate-set approximation is theoretically characterized by coverage error: the $\ell_1$ discrepancy between the ideal and candidate-restricted target policy is twice the excluded probability mass. The expected excluded mass decays exponentially with candidate set size and rollout/exploration budgets. The generator stability theorem bounds the error between candidate-restricted and full-action CTMC generators by the coverage error, scaled inversely by minimal bridge probability and normalizer. This substantiates the candidate-set approach for large action spaces, as empirical phase transitions validate sharp performance improvements when coverage exceeds thresholds (e.g., 16% action coverage in a 128-action gridworld).

Experimental Results

DRIFT is evaluated across Jericho text games, MinAtar, discretized D4RL MuJoCo, and combinatorial gridworlds. Notable results include:

Jericho Text Games: DRIFT achieves the highest average normalized score (23.2%) across 10 games, including outperforming LLM-driven baselines (CALM, KG-A2C) using only GRU encoders. Dramatic improvements are observed in games requiring deep puzzle chain reasoning, with positive scores across all games.
MinAtar: DRIFT demonstrates stable improvement relative to offline pretraining and avoids catastrophic forgetting due to path-space KL regularization, outperforming AWAC and SPA in most games.
D4RL Discretized MuJoCo: DRIFT reliably improves upon its (weak) offline initialization in all tasks, achieving best online scores on Hopper and competitive improvements on Walker and HalfCheetah, despite the low cardinality of discretized action spaces.
Controlled Ablation Studies and Large-Action Gridworlds: Candidate-set phase transitions and path-space KL regularization reveal critical stability advantages for DRIFT, especially in environments with multimodality or reward shifts (goal-switch tasks), where DRIFT adapts $1.9\times$ faster to new rewards than DQN/PPO.

Practical and Theoretical Implications

DRIFT advances offline-to-online RL by enabling generative policies in discrete spaces with robust online adaptation and catastrophic forgetting mitigation. The path-space KL regularization maintains bounded divergence from offline behaviors, which is essential for safety and policy reliability in real-world settings. Candidate-set approximation scales generative flow matching to large discrete action spaces without exhaustive enumeration or loss of multimodal exploration.

The formal stability guarantees address practitioner concerns regarding scalability and approximation error, making DRIFT suitable for domains like combinatorial optimization, language-based decision processes, and discrete robotic control. Empirical validation against strong baselines highlights DRIFT's adaptability and robustness, establishing it as a practical alternative to value-based approaches afflicted by mode collapse or forgetting.

Future Directions

Potential avenues for extending DRIFT include adaptive scheduling for path-space KL penalties, multi-objective RL leveraging factorized CTMCs, and applications to genuinely large or variable action spaces (e.g., real-time strategy games, protein sequence optimization). Further research on improved offline initialization and integration with continuous-discrete hybrid action domains could yield enhanced sample efficiency and cross-domain transfer.

Conclusion

This paper formalizes and empirically substantiates offline-to-online fine-tuning of discrete generative policies via discrete flow matching. DRIFT combines CTMC-based policy representation, candidate-set scaling, advantage-weighted updating, and rigorous path-space regularization to achieve consistent, stable improvements across varied RL tasks. The theoretical coverage and stability analysis, validated by empirical phase transitions, positions DRIFT as a robust framework for large-scale discrete RL, with strong implications for safety-critical and multimodal policy learning.

Markdown Report Issue