Flow Reasoning Models (FRMs)

Updated 4 July 2026

Flow Reasoning Models (FRMs) are reasoning systems that treat inference as an evolving flow over intermediate states rather than a one-shot mapping.
They employ both continuous and discrete formulations, applying denoising dynamics on structured tasks like Sudoku and Zebra to achieve stable solutions.
FRMs utilize self-conditioning and flow-based verification to iteratively refine predictions, offering scalable and checkable reasoning methods with high empirical accuracy.

Searching arXiv for papers on Flow Reasoning Models and adjacent flow-based reasoning methods. Flow Reasoning Models (FRMs) are reasoning systems that treat inference as an evolving flow over intermediate states rather than as a single-shot mapping from input to answer. In the narrowest current usage, the term refers to discrete flow LLMs reinterpreted as fixed-point reasoners for structured, checkable tasks such as Sudoku and Zebra, where correct solutions behave as stable attractors of denoising dynamics (Helbling et al., 28 Jun 2026). In a broader interpretation suggested by adjacent work, the same organizing idea appears as conserved attention flow on graphs, latent cognitive trajectories across recursive iterations, and probabilistic flow over chain-of-thought prefixes (Xu et al., 2018, Li et al., 30 May 2025, Liu et al., 14 Jan 2026).

1. Conceptual foundations and early formulations

A mathematical precursor to FRM-style thinking is the continuous-flow view of deep networks, where a residual network is interpreted as an Euler discretization of a characteristic ordinary differential equation, and depth becomes iterative refinement along a latent trajectory (Li et al., 2017). This does not address reasoning directly, but it establishes a vocabulary that later FRM work reuses: latent state evolution, velocity fields, transport dynamics, and discretized flow.

A more direct precursor appears in graph reasoning. “Modeling Attention Flow on Graphs” formalizes reasoning as a normalized distribution moving over graph structure rather than as a hidden process entangled only inside node vectors (Xu et al., 2018). The explicit node and edge flows are

$\tilde a_{ij}^t = T_{ij}^t a_i^t, \qquad a_j^{t+1} = \sum_i \tilde a_{ij}^t,$

with conservation constraints

$\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$

Here $a_i^t$ is focused attention on node $i$ , $\tilde a_{ij}^t$ is flowing attention on edge $(i,j)$ , and the destination prediction is read from the terminal flow state. This formulation already contains several FRM-defining elements: an explicit reasoning state, a learned transition operator, and a distinction between representational state and reasoning flow.

Later interpretability work sharpened the same intuition in LLMs. “The Geometry of Reasoning: Flowing Logics in Representation Space” argues that reasoning corresponds to smooth flows in representation space and that logical statements act as local controllers of these flows’ velocities (Zhou et al., 10 Oct 2025). It distinguishes position $y_t$ , local increments $\Delta y_t = y_t - y_{t-1}$ , and curvature, and reports that logic is more invariant in first- and second-order geometry than in raw hidden-state position. “Fluid Representations in Reasoning Models” makes a related empirical claim: during long chain-of-thought, hidden representations of actions and predicates become progressively more abstract and less tied to surface lexical form, a phenomenon the authors call Fluid Reasoning Representations (Kharlapenko et al., 4 Feb 2026). Taken together, these works suggest that FRMs are not only about explicit flow variables; they are also about the idea that reasoning quality is encoded in the trajectory itself.

2. Discrete flow LLMs as fixed-point reasoners

The most explicit use of the name appears in “Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement” (Helbling et al., 28 Jun 2026). Its starting point is a discrete flow LLM that embeds tokens into a continuous space, adds noise, and learns a denoising vector field. For a token sequence $w=(w_1,\dots,w_L)$ with embedding map $e$ , the clean endpoint is

$\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$ 0

and the forward path is

$\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$ 1

with $\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$ 2. The denoiser is

$\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$ 3

The central observation is that, on structured and checkable tasks, correctness is reflected not only in what the model can generate, but also in the geometry of its denoising dynamics. Correct completed solutions tend to be stable fixed points: if a candidate solution is re-noised and then re-solved, the model tends to return to that same solution. This produces a verifier without a separate verifier network. For a candidate $\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$ 4, re-noised state

$\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$ 5

and re-solve map

$\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$ 6

the stability score is

$\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$ 7

with lower score indicating greater dynamical stability. Best-of- $\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$ 8 ranking is then

$\sum_i a_i^t = 1, \qquad \sum_{ij} \tilde a_{ij}^t = 1.$ 9

Inference therefore becomes propose-and-verify test-time scaling. An inner loop samples a candidate by self-conditioned denoising; an outer loop verifies stability and either accepts the candidate or restarts from fresh noise under a compute budget. This reframes reasoning as attractor search in denoising state space rather than textual deliberation over explicit natural-language rationales. The same model proposes, refines, and verifies.

3. Self-conditioning, FlowDPO, and empirical profile

The training recipe of narrow-sense FRMs has two main parts: self-conditioning and FlowDPO (Helbling et al., 28 Jun 2026). Self-conditioning adds a channel carrying previous logits,

$a_i^t$ 0

and closes that channel at inference so that the model recursively refines its own past predictions. With time discretization $a_i^t$ 1, refinement proceeds by

$a_i^t$ 2

and Euler stepping

$a_i^t$ 3

The model commits when the decoded assignment stops changing.

FlowDPO then trains against the model’s own failed generations. The preferred sample $a_i^t$ 4 is the gold completion; the dispreferred sample $a_i^t$ 5 is a self-mined confident wrong completion. The contrast is localized only to wrong cells via

$a_i^t$ 6

and the masked denoising score is

$a_i^t$ 7

This makes the preference objective suppress the model’s own incorrect attractors rather than globally downweighting entire wrong sequences, including their already-correct parts.

Empirically, the framework changes the behavior of flow models on structured reasoning tasks. A base flow model trained on Sudoku solves only about $a_i^t$ 8 of puzzles when sampled directly. With self-conditioning alone, single-shot Sudoku jumps from approximately $a_i^t$ 9 to $i$ 0 in the six-seed mean. In the in-distribution round-scaling table, the base flow model goes from $i$ 1 at 1 round to $i$ 2 at 64 rounds, while self-conditioning reaches $i$ 3 at 1 round and $i$ 4 by 4 rounds. In the compute-matched comparison, FRM reaches $i$ 5 Sudoku at 7 forward passes, whereas Adaptive MDM needs 57 forward passes for the same accuracy. With self-verification scaling, solve rate reaches approximately $i$ 6 on Sudoku-Shah, $i$ 7 on Zebra, and $i$ 8 on out-of-distribution Sudoku-Extreme, which is evaluation-only and never seen in training. The paper’s interpretation is that the verifier signal is already strong; the remaining bottleneck is proposer coverage.

4. Language-model variants and adjacent flow-based reasoning frameworks

Several recent methods are FRM-adjacent rather than FRMs in the strict sense. They differ mainly in what flows, what is trained, and whether the flow is the primary reasoning mechanism or an auxiliary control signal.

Paper	Flow object	Relation to FRMs
“SCOUT” (Li et al., 30 May 2025)	Latent cognitive trajectory $i$ 9	Discrete latent reasoning flow via recursive refinement
“CoT-Flow” (Liu et al., 14 Jan 2026)	Probabilistic flow over CoT prefixes	Dense step-wise progress for decoding and RL
“RLFR” (Zhang et al., 11 Oct 2025)	Latent-space vector field over hidden states	Reward shaping, not a flow-native reasoner
“FlowSteer” (Li et al., 5 Feb 2026)	Velocity field transporting verbose to concise activations	Inference-time control, not an FRM architecture
“Masked Language Flow Models” (Azangulov et al., 26 Jun 2026)	Continuous flow over masked token embeddings	Conditional generation with iterative token commitment
“Proof Flow” (Ho et al., 2024)	GFlowNet probability flow over proof trajectories	Distribution over reasoning paths in formal proving

“SCOUT: Teaching Pre-trained LLMs to Enhance Reasoning via Flow Chain-of-Thought” defines Flow CoT as a cognitive trajectory of latent states $\tilde a_{ij}^t$ 0, with updates

$\tilde a_{ij}^t$ 1

Each iteration is supervised by a different-capacity teacher, and the model shows monotonic average progression $\tilde a_{ij}^t$ 2 across three iterations, for a final gain of $\tilde a_{ij}^t$ 3 over SFT on the reported eight-task average (Li et al., 30 May 2025). This is a latent recursive refinement model with per-stage alignment rather than a flow-matching generator.

“Efficient Paths and Dense Rewards: Probabilistic Flow Reasoning for LLMs” defines step-wise Probabilistic Flow Progress

$\tilde a_{ij}^t$ 4

where $\tilde a_{ij}^t$ 5 is a partial CoT prefix (Liu et al., 14 Jan 2026). The same signal drives flow-guided decoding and verifier-free dense-reward RL. On Qwen3-4B, AIME24 improves from $\tilde a_{ij}^t$ 6 under standard CoT to $\tilde a_{ij}^t$ 7 under CoT-Flow, while average token consumption is reduced by more than $\tilde a_{ij}^t$ 8. This is best read as a probabilistic flow framework over reasoning trajectories, not a continuous latent-state flow model.

“RLFR” uses a latent-space vector field as an auxiliary reward model: hidden-state transitions are compared against a reference flow environment trained on high-quality trajectories, and the resulting deviation is inserted as dense token-level reward (Zhang et al., 11 Oct 2025). The paper explicitly states that RLFR is not a reasoning model whose inference itself follows a continuous learned flow; the “flow” is environmental and evaluative. “FlowSteer” likewise is an inference-time hidden-state steering method that learns a velocity field transporting verbose reasoning activations to concise ones; it modifies hidden-state geometry and thereby reasoning style, but it is not a full FRM in the stronger sense of end-to-end flow-native reasoning (Li et al., 5 Feb 2026).

“Masked Language Flow Models” occupies an intermediate position. It introduces a Brownian-bridge interpolant for masked positions,

$\tilde a_{ij}^t$ 9

while unmasked positions remain fixed. Its online token promotion rule converts sufficiently confident latent positions into clean context during sampling: $(i,j)$ 0 The result is a hybrid of continuous denoising and discrete commitment, motivated explicitly by multi-step reasoning (Azangulov et al., 26 Jun 2026). “Proof Flow,” in turn, uses GFlowNets to train a theorem prover to sample proof trajectories proportional to reward rather than collapsing onto a single mode, making it a trajectory-distribution precursor to FRM-style reasoning over multiple valid paths (Ho et al., 2024).

5. Structured-state, graph, concept, and spatial variants

Outside the narrow language-model setting, several systems instantiate FRM-like principles by making intermediate reasoning structure explicit. “ReasoningFlow” parses long reasoning traces into left-to-right labeled DAGs with 8 node classes and 14 edge labels, turning planning, inference, verification, correction, and uncertainty into queryable subgraph motifs (Lee et al., 3 Jun 2025). On its manually annotated sample of 30 traces from QwQ-32B-Preview, average response length is $(i,j)$ 1 tokens and average graph size is $(i,j)$ 2 nodes. The framework does not train a graph-generating FRM, but it externalizes reasoning flow as an explicit semantic object.

“Concept Flow Models” are not language reasoners, but they provide an important architectural analogy (Wang et al., 17 Jun 2026). A prediction is a probabilistic traversal through a hierarchy, where each internal node uses a localized subset of concepts to route to children: $(i,j)$ 3 and leaf probability is

$(i,j)$ 4

The paper’s central claim is that hierarchical concept bottlenecks reduce information leakage while yielding stepwise decision flows. This is not FRM in the language-model sense, but it is a concrete example of semantically scoped reasoning flow through interpretable intermediate states.

“Spatial Reasoning with Denoising Models” addresses continuous variables rather than discrete tokens and is especially relevant to FRMs that use flow or denoising for conditional inference (Wewer et al., 28 Feb 2025). Its core state is a set of variables at individual denoising times,

$(i,j)$ 5

with $(i,j)$ 6. The paper reports that naive fully parallel denoising can hallucinate badly on structured reasoning tasks, while sequentialization and learned generation order matter greatly. On hard MNIST Sudoku, accuracy rises from below $(i,j)$ 7 to above $(i,j)$ 8, with the strongest reported jump from diffusion baseline $(i,j)$ 9 to SRM with predicted order and no overlap $y_t$ 0. This suggests that, for FRMs over structured variables, the flow schedule itself is part of the reasoning algorithm.

6. Scope, misconceptions, and open questions

A common misconception is that any use of “flow” in reasoning automatically defines an FRM. The literature is more differentiated. RLFR explicitly states that it is not a reasoning model whose inference itself follows a continuous learned flow; it is a reward-shaping method over hidden-state trajectories (Zhang et al., 11 Oct 2025). FlowSteer explicitly presents itself as an inference-time hidden-state steering mechanism rather than a flow-native reasoning architecture (Li et al., 5 Feb 2026). SCOUT, CoT-Flow, and Proof Flow are closer in spirit, but each instantiates a different notion of flow: recursive latent evolution, probabilistic progress over CoT prefixes, or reward-proportional distributions over proof trajectories.

A second misconception is that explicit flow always improves predictive performance. The graph-attention precursor reports a more nuanced picture: explicit attention flow often improves both interpretation and prediction, especially on larger graphs and location- or history-dependent trajectories, but standard black-box GNNs can outperform explicit-flow variants on some time-dependent sine settings (Xu et al., 2018). The broader implication is that explicit flow is most valuable when the task genuinely depends on latent path structure and when the architecture uses the flow state as an active controller rather than as readout-only explanation.

A third misconception is that current FRMs solve general reasoning. The strongest fixed-point FRM results apply where correctness corresponds to a checkable stable state; the paper itself states that the method applies cleanly where correctness corresponds to a checkable stable state, while open-ended generation and tasks without clear fixed points are untested (Helbling et al., 28 Jun 2026). Adjacent papers point to open directions rather than finished solutions: multi-channel or typed flows beyond a single scalar attention distribution (Xu et al., 2018), dynamic iteration control and adaptive teacher selection (Li et al., 30 May 2025), multi-layer or full-trajectory reasoning control (Li et al., 5 Feb 2026), off-policy extensions for dense flow-based RL (Liu et al., 14 Jan 2026), and automatic graph extraction plus graph-conditioned evaluation for semantic trace DAGs (Lee et al., 3 Jun 2025). This suggests that FRMs are best understood not as a settled model family but as an emerging research program centered on explicit reasoning trajectories, state-dependent transport, and the coupling of path structure with inference control.