SoFlow: Direct Generative ODE Modeling

Updated 18 December 2025

SoFlow is a generative modeling framework that directly learns the closed-form solution of the velocity ODE underlying diffusion-based models, enabling efficient one- or few-step data generation.
It utilizes a novel parameterization with complementary Flow Matching and Solution Consistency losses and leverages Diffusion Transformers in VAE latent space for superior performance.
The method significantly reduces GPU memory usage and training time while achieving competitive FID scores compared to traditional multi-step diffusion and GAN approaches.

Solution Flow Models (SoFlow) constitute a generative modeling framework that directly learns the closed-form solution of the velocity ordinary differential equation (ODE) underlying diffusion-based models, enabling efficient one-step or few-step data sample generation. By explicitly modeling the mapping from a latent prior to data in a single network pass, SoFlow overcomes the inefficiency of traditional multi-step denoising approaches. The approach is characterized by a novel parameterization of the generative ODE’s solution, a pair of complementary loss functions—Flow Matching and Solution Consistency—and an architecture leveraging Diffusion Transformers (DiT) in VAE latent space, achieving state-of-the-art performance among one-step generative models on ImageNet 256×256 (Luo et al., 17 Dec 2025).

1. Mathematical Structure of the SoFlow Framework

SoFlow starts from a continuous interpolant (“noising process”) bridging data $x_0\sim p_{\rm data}$ and a tractable prior $x_1\sim\mathcal{N}(0,I)$ : $x_{t} = \alpha_{t}\,x_{0} + \beta_{t}\,x_{1},\qquad t\in[0,1]$ with $\alpha_0=1$ , $\beta_0=0$ , $\alpha_1=0$ , $\beta_1=1$ , $\alpha,\beta\in C^1$ . This yields a marginal velocity field

$v(x_{t},t) = \mathbb{E}_{p(x_0,x_1\mid x_t)}\left[\alpha_t' x_0 + \beta_t' x_1\right]$

defining the generative ODE,

$\frac{dX(t)}{dt} = v\big(X(t),t\big),\qquad X(1)=x_{1}\sim\mathcal{N}(0,I)$

to be solved backward in time. Rather than numerically integrating this ODE, SoFlow directly learns its solution function: $f(x_t,\,t,\,s) = X(s) \quad\text{where } X(\cdot) \text{ solves } \frac{dX}{du}=v(X,u),\, X(t)=x_t$ satisfying

$f(x_t,t,t) = x_t,\qquad \partial_{s}f(x_t,t,s) = v\big(f(x_t,t,s),s\big)$

for $0\leq s\leq t\leq1$ . Thus $f(\cdot,\cdot,\cdot)$ instantaneously maps any $x_t$ to $x_s$ in closed-form, fundamentally distinguishing SoFlow from velocity-based diffusion and flow-matching models (Luo et al., 17 Dec 2025).

2. Loss Function Design: Flow Matching and Solution Consistency

The parametric solution map is

$f_{\theta}(x_t,t,s) = a(t,s)x_t + b(t,s)F_{\theta}(x_t,t,s)$

where $a(t,s)$ and $b(t,s)$ are known (e.g., Euler or trigonometric parameterizations), $F_\theta$ is a neural network, and $a(t,t)=1$ , $b(t,t)=0$ . Flow Matching loss anchors the network’s instantaneous velocity: $\mathcal{L}_{\rm FM}(\theta) = \mathbb{E}_{t,x_0,x_1} \left[ \frac{w_{\rm FM}(t)}{n} \left\| \partial_{2}a(t,t)x_t + \partial_{2}b(t,t) F_{\theta}(x_t,t,t) - (\alpha_t' x_0 + \beta_t' x_1) \right\|^2 \right]$ with $w_{\rm FM}(t)$ an adaptive weight, $t$ sampled logit-normal. The correct instantaneous velocity is enforced via analytic differentiation of the solution map at $s=t$ .

Classifier-Free Guidance (CFG) is incorporated by interpolating class-conditional and unconditional velocity estimates during loss computation.

To guarantee correct solution mapping over finite intervals, SoFlow introduces a Solution Consistency loss: $\mathcal{L}_{\rm SCM}(\theta) = \mathbb{E}_{t,l,s,x_0,x_1} \left[ \frac{w_{\rm SCM}(t,l,s)}{n}\left\| f_{\theta}(x_t,t,s) - f_{\theta}^-\left(x_t+(\alpha_t'x_0+\beta_t'x_1)(l-t),l,s\right) \right\|^2 \right]$ where $f_\theta^-$ is a stopped-gradient network copy and $w_{\rm SCM}(t,l,s)$ is an adaptive weight. The final objective is a weighted sum

$\mathcal{L}(\theta) = \lambda\,\mathcal{L}_{\rm FM}(\theta) + (1-\lambda)\,\mathcal{L}_{\rm SCM}(\theta)$

with $\lambda=0.75$ . Notably, $\mathcal{L}_{\rm SCM}$ involves no Jacobian-vector products, yielding superior training efficiency compared to flow-anchored objectives (Luo et al., 17 Dec 2025).

3. Model Architecture and Implementation Protocols

SoFlow adopts the Diffusion Transformer (DiT) backbone, operating in VAE latent space (32×32×4) for ImageNet 256×256 generation. Model variants include B/2 (131M), M/2 (308M), L/2 (459M), XL/2 (676M), with patch size 2×2 for B/2 upwards. Training from scratch uses batch size 256 and 240 epochs, AdamW optimizer (lr= $1\times 10^{-4}$ , betas=(0.9,0.99)), no weight decay or lr decay, and EMA 0.9999.

Hyperparameters:

Time sampling: logit-normal for Losses ( $(\mu_{\rm FM},\sigma_{\rm FM})=(-0.2,1.0)$ for $\mathcal{L}_{\rm FM}$ , $(\mu_t,\sigma_t)=(0.2,0.8)$ , $(\mu_s,\sigma_s)=(-1.0,0.8)$ for $\mathcal{L}_{\rm SCM}$ ).
Noising schedule: linear ( $\alpha_t=1-t,\;\beta_t=t$ ) with Euler parameterization.
CFG strength $w$ and velocity-mix $m$ tuned per model size; $w$ decays from 2.5/2.0 to 1.0 for large $t$ .
CIFAR-10 experiments use a U-Net backbone, RAdam, batch size 1024, 800K steps, analogous settings (Luo et al., 17 Dec 2025).

4. Empirical Performance and Benchmarks

On ImageNet 256×256, SoFlow sets new FID-50K standards among one-step generative models across all tested DiT model scales:

Method	Params	1–NFE FID
MeanFlow B/2	131 M	6.17
SoFlow B/2	131 M	4.85
MeanFlow M/2	308 M	5.01
SoFlow M/2	308 M	3.73
MeanFlow L/2	459 M	3.84
SoFlow L/2	459 M	3.20
MeanFlow XL/2	676 M	3.43
SoFlow XL/2	676 M	2.96

For two function evaluations, SoFlow XL/2 achieves 2.66 FID (vs. 2.93 for MeanFlow XL/2). These results are competitive with multi-step diffusion, autoregressive, and GAN methods at comparable or lower NFE, with SoFlow’s performance realized at significantly reduced inference cost. SoFlow consistently outperforms MeanFlow, the previous strongest one-step baseline (Luo et al., 17 Dec 2025).

5. Sampling, Inference Efficiency, and Practical Implications

SoFlow’s generative sampling proceeds as:

Sample $x_1\sim\mathcal{N}(0,I)$ ;
Produce $x_0 = f_\theta(x_1, 1, 0)$ via a single forward pass.

Optional few-step sampling is possible by perturbing $x_0$ and recursively invoking the solution map. The Solution Consistency loss eliminates JVPs, reducing peak GPU memory by 31% and enabling 23% faster training than MeanFlow on H100 GPUs. SoFlow inherits the computational efficiency of state-of-the-art attention kernels used in the DiT backbone (Luo et al., 17 Dec 2025).

6. Limitations, Extensions, and Forward-Looking Directions

SoFlow is currently best suited to scenarios demanding minimal NFE, with a tradeoff remaining at ultra-low NFE budgets compared to deep multi-step diffusion. Several extension directions are highlighted:

Optimization of noising/interpolant schedules.
Improved weighting schemes or variance reduction within the loss.
Application to text-to-image, video, or hybrid few-step regimes.
Empirical exploration of higher NFE hybrids (2–4 steps) for further FID improvements.

SoFlow’s formulation offers a platform for rapid progress in efficient generative modeling, with a unified framework that supports precise velocity field learning, CFG integration, and closed-form mapping from prior to data (Luo et al., 17 Dec 2025).

For the SoFlow “Semi-dilute-Flow” model relevant to polymer compression in Couette flow, see Dunstan (Dunstan, 2014). That distinct SoFlow framework models coil compression, not generative learning.