Flow-Matching Critics in RL

Updated 4 July 2026

Flow-Matching Critics are models that use time-indexed velocity fields to incorporate intermediate dynamic information into value estimation.
They span applications in reinforcement learning and generative modeling, enabling distributional return estimation, iterative TD learning, and diagnostic evaluations.
Experimental findings demonstrate improved sample efficiency and performance stability across continuous control tasks and image anomaly detection benchmarks.

Flow-matching critics are critic constructions that use flow matching, or closely related velocity-field formulations, to produce evaluative signals rather than only samples. In reinforcement learning, the critic may itself be a conditional flow model over return distributions or Q-values; in actor–critic formulations for continuous-time generative models, the critic may be a value function over intermediate flow states or an action-sensitive Q-function whose gradients are translated into denoising-time velocity corrections; and in diagnostic settings, a learned flow field can serve as a critic by measuring disagreement between learned and geometric dynamics (Zhong et al., 26 Oct 2025, Fan et al., 20 Oct 2025, Chen et al., 21 May 2026).

1. Scope and terminology

Across the literature, the term is used in several related senses. The common object is a time-indexed field—usually a velocity field—conditioned on state, action, prompt, or intermediate latent, and then integrated, regressed, or compared to obtain a value-like signal.

Setting	Critic form	Representative papers
Reinforcement learning	Conditional flow model of $Z^\pi(s)$ , $Z^\pi(s,a)$ , or $Q^\pi(s,a)$	(Zhong et al., 26 Oct 2025, Agrawalla et al., 8 Sep 2025, Groom et al., 8 May 2026)
Actor–critic for flow models	Value function over intermediate states, or Q-function guiding denoising-time corrections	(Fan et al., 20 Oct 2025, Wang et al., 6 Jun 2026)
Diagnostic evaluation	Learned velocity field used as a consistency score	(Chen et al., 21 May 2026)

In the narrowest RL sense, a flow-matching critic replaces a scalar $V_\phi(s)$ or $Q_\phi(s,a)$ with a conditional generative flow model of the return distribution. In broader generative-modeling usage, the phrase can denote a critic attached to a flow-matching actor, or a critic-like diagnostic that evaluates whether a learned flow agrees with a prescribed path. This suggests that the defining feature is not a single architecture, but the use of flow-time structure to make evaluation depend on intermediate dynamics rather than only on terminal outputs.

2. Return-distribution critics in reinforcement learning

The clearest direct formulation appears in FlowCritic, which replaces the usual scalar critic with a conditional flow-matching generative model of the return distribution $p^\pi(z\mid s)$ . The critic draws a prior sample $\varepsilon\sim\mathcal N(0,1)$ , evolves it through a conditional ODE,

$\frac{d o^t}{dt} = f_\theta(o^t,t,s), \qquad o^0=\varepsilon,$

and returns $z=o^1=\mathcal F_\theta(s,\varepsilon)$ as a sampled return. The expectation $V^\pi(s)=\mathbb E[Z^\pi(s)]$ is then recovered by Monte Carlo, while higher-order statistics such as variance and coefficient of variation are available from the same sampled critic. FlowCritic motivates this design by contrasting it with scalar critics, ensemble critics, and distributional RL methods such as C51, QR-DQN, IQN, FQF, DSAC, and TQC; the stated claim is that scalar critics only regress a mean, ensembles only aggregate finitely many point estimates, and existing distributional methods either discretize support or impose restricted parametric forms (Zhong et al., 26 Oct 2025).

Training is Bellman-bootstrapped in distribution space. FlowCritic defines a distributional TD sample

$Z^\pi(s,a)$ 0

and a TD $Z^\pi(s,a)$ 1-style target

$Z^\pi(s,a)$ 2

A conditional flow-matching loss then regresses the velocity field onto the linear-path velocity between $Z^\pi(s,a)$ 3 and $Z^\pi(s,a)$ 4. To stabilize this bootstrapped learning, FlowCritic adds velocity-field clipping analogous to PPO clipping. For baseline estimation it samples $Z^\pi(s,a)$ 5 returns per state, sorts them, discards the top $Z^\pi(s,a)$ 6, and uses the truncated mean as a pessimistic estimate. For policy learning it computes a coefficient of variation

$Z^\pi(s,a)$ 7

forms weights $Z^\pi(s,a)$ 8, normalizes them over the batch, and inserts them into a weighted PPO objective. The reported experimental setup uses 12 continuous-control tasks in IsaacGym, five random seeds, and average episodic return; the reported result is that FlowCritic dominates all baselines on all 12 tasks, with especially large gains on Ant, AllegroHand, ShadowHand, and Humanoid, and that policies were also transferred to a Unitree Go2 quadruped for locomotion and stair climbing (Zhong et al., 26 Oct 2025).

A more explicitly distributional and theoretically aligned variant is FlowIQN. Its starting point is a metric mismatch: distributional RL theory relies on the distributional Bellman operator being contractive in the $Z^\pi(s,a)$ 9-Wasserstein distance, whereas standard conditional flow-matching critics train with arbitrary source–target couplings. FlowIQN addresses this by sorting source and Bellman-target samples within each mini-batch to approximate the monotone optimal transport coupling. The resulting quantile-coupled loss is trained on paired order statistics, and the paper proves that the loss yields a Wasserstein-aligned approximate projection compatible with the foundations of distributional RL. It also introduces shortcut models for efficient inference. Empirically, FlowIQN improves Wasserstein return-distribution accuracy over other conditional flow-matching critics and yields competitive performance on offline RL benchmarks across multiple policy extraction methods (Groom et al., 8 May 2026).

3. Iterative TD critics and mechanistic explanations

A second RL lineage treats flow matching less as return-distribution modeling and more as an iterative parameterization of expected-value TD learning. The paper introducing floq defines a critic whose output is the endpoint of a learned flow over a scalar latent. Starting from $Q^\pi(s,a)$ 0, the method integrates a velocity field $Q^\pi(s,a)$ 1 through $Q^\pi(s,a)$ 2 Euler steps and defines

$Q^\pi(s,a)$ 3

The TD target is built by integrating a target velocity field on next-state actions and averaging over multiple latent initializations, after which the critic is trained with a flow-matching loss on interpolants between $Q^\pi(s,a)$ 4 and the scalar TD target. The paper’s stated motivation is that dense supervision of intermediate computations is a hallmark of modern large-scale machine learning, and its reported result is that floq improves performance by nearly $Q^\pi(s,a)$ 5 across a suite of challenging offline RL benchmarks and online fine-tuning tasks (Agrawalla et al., 8 Sep 2025).

The later analysis of this class of critics argues that their success is not explained by distributional RL. In controlled comparisons, explicitly modeling return distributions can reduce performance, whereas integration for value readout and dense velocity supervision improve TD learning through two mechanisms: test-time recovery and plasticity. Test-time recovery means that iterative integration dampens errors made in early steps of the computation as more integration steps are performed; the paper formalizes this with a stability factor $Q^\pi(s,a)$ 6 that decreases with $Q^\pi(s,a)$ 7, and relates it to a $Q^\pi(s,a)$ 8-conic condition on the velocity field. Plasticity means that dense supervision at multiple interpolant values induces feature learning that can represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. The paper’s reported empirical result is that flow-matching critics substantially outperform monolithic critics, with $Q^\pi(s,a)$ 9 in final performance and around $V_\phi(s)$ 0 in sample efficiency in settings where loss of plasticity poses a challenge, such as high-UTD online RL (Agrawalla et al., 4 Mar 2026).

This line of work therefore shifts the interpretation of flow-matching critics. Rather than treating them primarily as distribution estimators, it treats them as compute-scalable TD critics whose iterative integration and dense supervision change the optimization and representation properties of value learning. A plausible implication is that the most consequential distinction from standard critics is architectural and dynamical, not merely probabilistic.

4. Critics for flow-matching actors and policies

In continuous-time flow models, the critic may instead evaluate intermediate flow states while the actor remains the velocity field itself. AC-Flow is an actor–critic framework built specifically for continuous-time flow-matching generative models. The actor is the vector field $V_\phi(s)$ 1; the critic is a lightweight scalar value function $V_\phi(s)$ 2 defined over intermediate states $V_\phi(s)$ 3. Because the training reward is terminal, $V_\phi(s)$ 4 is regressed by Monte Carlo onto the shaped terminal reward $V_\phi(s)$ 5, where reward shaping is implemented by per-batch min–max scaling to $V_\phi(s)$ 6. Actor updates do not use the critic immediately: during a warm-up phase of $V_\phi(s)$ 7 steps, the critic is trained but the actor uses group-relative advantages instead. After warm-up, AC-Flow uses critic-based advantages $V_\phi(s)$ 8, clips them into $V_\phi(s)$ 9 with $Q_\phi(s,a)$ 0, exponentiates them into weights $Q_\phi(s,a)$ 1 with $Q_\phi(s,a)$ 2, and inserts those weights into a reweighted conditional flow-matching objective plus a Wasserstein-2 regularizer toward the reference model. The reported experiments on Stable Diffusion 3 state that this combination of reward shaping, advantage clipping, warm-up, and critic weighting is essential for stability, and that AC-Flow achieves the best CLIPScore and best human-preference metrics among the compared methods (Fan et al., 20 Oct 2025).

Q-VGM addresses a different problem: how to use a critic to fine-tune a flow-matching vision-language-action policy when direct backpropagation through the denoising chain is numerically unstable and tractable action likelihoods are unavailable. Its critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Rather than backpropagating through the full denoising chain, Q-VGM uses the critic’s action gradient $Q_\phi(s,a)$ 3 to build denoising-time velocity corrections. At each denoising state $Q_\phi(s,a)$ 4, it forms a look-forward clean action estimate under the frozen base velocity, improves it by iterative Q-gradient ascent with keep-best selection, converts the resulting action displacement into an effective residual velocity $Q_\phi(s,a)$ 5, and trains the policy by residual velocity matching,

$Q_\phi(s,a)$ 6

The key claim is that this requires no action likelihoods and no backpropagation through the denoising chain. Reported results are substantial: on LIBERO, average success rises from $Q_\phi(s,a)$ 7 to $Q_\phi(s,a)$ 8; on RoboTwin 2.0, from $Q_\phi(s,a)$ 9 to $p^\pi(z\mid s)$ 0; and on two real-robot tabletop tasks, from $p^\pi(z\mid s)$ 1 to $p^\pi(z\mid s)$ 2 (Wang et al., 6 Jun 2026).

These works show a broader meaning of the phrase. A flow-matching critic need not itself be a flow-based value estimator; it may be a value function or Q-ensemble whose outputs are made compatible with a flow-matching actor by translating scalar value information into time-indexed weights or velocity corrections.

5. Diagnostic and evaluative critics from velocity fields

A further usage treats a learned flow field itself as a critic in the evaluative or diagnostic sense. Flow Mismatching trains a standard conditional flow-matching model on normal images only, then, at test time, compares the learned normal velocity $p^\pi(z\mid s)$ 3 with the geometric velocity $p^\pi(z\mid s)$ 4 along affine paths from Gaussian noise $p^\pi(z\mid s)$ 5 to a test image $p^\pi(z\mid s)$ 6. The local discrepancy is

$p^\pi(z\mid s)$ 7

and the method aggregates it across random seeds and across time to obtain pixel-wise heatmaps and image-level anomaly scores. The paper’s theoretical result is a decomposition of the population mismatch into a weighted denoising error and a Fisher-divergence term between test-path and normal-path score functions:

$p^\pi(z\mid s)$ 8

This identifies the score-gap component that drives anomaly separation. The reported image-domain results on MVTec-AD and VisA include pixel-level AUROC $p^\pi(z\mid s)$ 9, AP $\varepsilon\sim\mathcal N(0,1)$ 0, F1-max $\varepsilon\sim\mathcal N(0,1)$ 1, and PRO $\varepsilon\sim\mathcal N(0,1)$ 2 on MVTec, and pixel-level AUROC $\varepsilon\sim\mathcal N(0,1)$ 3, AP $\varepsilon\sim\mathcal N(0,1)$ 4, F1-max $\varepsilon\sim\mathcal N(0,1)$ 5, and PRO $\varepsilon\sim\mathcal N(0,1)$ 6 on VisA (Chen et al., 21 May 2026).

A complementary diagnostic perspective comes from a denoising analysis of the generation process. That work establishes the formal equivalence between an optimal flow-matching velocity and an optimal denoiser,

$\varepsilon\sim\mathcal N(0,1)$ 7

and then studies the generation process by per-time denoising PSNR, time-dependent perturbations, and Jacobian profiles. It reports three functional phases: an early drift-sensitive phase, an intermediate content or structure phase, and a late cleanup or detail phase. Noise-type perturbations applied late in generation strongly degrade FID, whereas drift-type perturbations applied early can produce large pairwise trajectory deviations while leaving FID comparatively robust. This suggests that flow-matching critics in the diagnostic sense can be phase-aware: they can evaluate not only final sample quality but also where in time a model is sensitive to local noise, global drift, or Jacobian regularization (Gagneux et al., 28 Oct 2025).

6. Theoretical issues, limitations, and open directions

Several papers identify tensions between the expressive promise of flow-matching critics and the properties of empirical flow matching itself. One issue is coupling design. Weighted Conditional Flow Matching shows that standard conditional flow matching can be reweighted by a Gibbs kernel $\varepsilon\sim\mathcal N(0,1)$ 8, recovering the entropic OT coupling up to some bias in the marginals, and becoming equivalent to minibatch OT-CFM in the large-batch limit. This is directly relevant to critics, because the source–target coupling in a flow-matching loss determines which trajectories are emphasized; the same paper also characterizes when the marginal tilting remains nearly unchanged (Calvo-Ordonez et al., 29 Jul 2025).

A second issue is structural bias in empirical flow matching. An analysis of empirical flow-matching samplers shows that the empirical minimizer is almost never a gradient field, even when each conditional flow is, and is therefore intrinsically energetically suboptimal. With Gaussian sources, instantaneous and integrated kinetic energies exhibit exponential concentration; with heavy-tailed sources they have polynomial tails; and the paper argues that these behaviors are governed primarily by the choice of source distribution rather than the data. For flow-matching critics built on the same empirical FM machinery, this suggests that transport geometry and energy statistics may inherit source-driven and non-conservative biases (Lim, 18 Dec 2025).

A third issue concerns the low-noise regime. The analysis of low-noise pathology shows that as noise levels approach zero, arbitrarily small perturbations in the input can induce large variations in the velocity target, causing the condition number of the learning problem to diverge. Its proposed remedy, Local Contrastive Flow, replaces direct velocity regression with contrastive feature alignment at small noise levels while retaining standard flow matching at moderate and high noise. Although that work is framed around generative and representation learning, it presents a critic-like alternative to brittle low-noise regression and suggests that some flow-matching critics may need phase-dependent objectives rather than a single uniform regression rule (Zeng et al., 25 Sep 2025).

There are also more domain-specific limitations. FlowCritic reports computational overhead from multiple samples and ODE integrations per state, hyperparameter sensitivity in $\varepsilon\sim\mathcal N(0,1)$ 9, $\frac{d o^t}{dt} = f_\theta(o^t,t,s), \qquad o^0=\varepsilon,$ 0, $\frac{d o^t}{dt} = f_\theta(o^t,t,s), \qquad o^0=\varepsilon,$ 1, and $\frac{d o^t}{dt} = f_\theta(o^t,t,s), \qquad o^0=\varepsilon,$ 2, and on-policy-only experiments, while noting that off-policy adaptation may be non-trivial (Zhong et al., 26 Oct 2025). FlowIQN’s Wasserstein-aligned guarantee is specific to the 1D return setting with quantile coupling, so extending the same logic to higher-dimensional targets remains open (Groom et al., 8 May 2026). AC-Flow documents that naïve online actor–critic training in flow matching can blow up and collapse unless reward shaping, clipping, and warm-up are used (Fan et al., 20 Oct 2025). Q-VGM shows that directly backpropagating a Q-max objective through the full denoising chain can degrade performance at VLA scale (Wang et al., 6 Jun 2026).

At the level of underlying generative theory, a deterministic, non-asymptotic upper bound is available: if the $\frac{d o^t}{dt} = f_\theta(o^t,t,s), \qquad o^0=\varepsilon,$ 3 flow-matching loss is bounded by $\frac{d o^t}{dt} = f_\theta(o^t,t,s), \qquad o^0=\varepsilon,$ 4, then, under regularity assumptions on the data path and velocity fields, the terminal KL divergence is bounded by

$\frac{d o^t}{dt} = f_\theta(o^t,t,s), \qquad o^0=\varepsilon,$ 5

This provides one route for relating velocity approximation error to distributional error and implies statistical convergence rates under total variation distance, but the same paper also makes clear that such guarantees depend on strong regularity assumptions (Su et al., 7 Nov 2025). A broader critical implication is that the theory of flow-matching critics is inseparable from the theory of flow matching itself: transport coupling, path regularity, integration depth, and low-noise behavior all shape what kind of critic signal is actually learned.

Flow-matching critics therefore constitute a family rather than a single method. In one branch they are generative critics for returns or Q-values; in another they are value estimators or Q-ensembles attached to flow-matching actors; in a third they are velocity-based diagnostics. What unifies them is the use of flow time as an evaluative dimension: instead of reading out value, uncertainty, or consistency in one shot, they exploit the structure of an interpolating or denoising trajectory. This suggests that future work will continue to revolve around four linked questions: how to choose source–target couplings, how to stabilize intermediate-time supervision, how to exploit critic information without differentiating through long denoising chains, and how to reconcile empirical flow matching with the transport metrics and regularity assumptions that underwrite its analysis.