Flow-based Transformers SiTs

Updated 15 January 2026

The paper introduces SiT, a unifying framework that leverages continuous-time flow and score matching to significantly improve FID metrics compared to previous latent diffusion models.
SiTs employ both velocity and score matching using a DiT backbone, enabling efficient deterministic (ODE) and stochastic (SDE) sampling within a modular interpolation setting.
SiTs consistently outperform DiT baselines on ImageNet benchmarks by achieving lower FID scores while maintaining architectural consistency and scalability.

Flow-based Transformers, and specifically the Scalable Interpolant Transformers (SiTs), represent a unifying framework that merges continuous-time flow-based generative modeling with the Transformer architecture. SiTs enable both flow-matching (probability flow ODE) and score-based diffusion paradigms within a single, modular setting. They expand the generative capacity and scalability of Transformer-based models for high-fidelity and efficient large-scale synthesis. The SiT framework is most extensively detailed in "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers" (Ma et al., 2024), which systematically analyzes and outperforms prior latent diffusion Transformer models.

1. Foundations: Stochastic Interpolant Transformers (SiTs) and Flow-based Modeling

A SiT connects data distribution $p_{\mathrm{data}}$ and latent noise distribution $\mathcal{N}(0, I)$ through a parametrized stochastic interpolant:

$x_t = \alpha_t x_0 + \sigma_t x_1, \quad t \in [0,1], \quad (x_0 \sim p_{\mathrm{data}},\ x_1 \sim \mathcal{N}(0, I))$

where $\alpha_t, \sigma_t$ are smooth, valid schedule functions with $\alpha_0 = 1, \sigma_0 = 0$ (at data), $\alpha_1 = 0, \sigma_1 = 1$ (at pure noise), and $\alpha_t^2 + \sigma_t^2 > 0$ throughout $[0,1]$ .

This generalizes classic denoising diffusion probabilistic models (DDPM), enabling both ODE-based (flow) and SDE-based (diffusion) perspectives under a single formalism. The marginal at time $t$ , $p_t(x)$ , corresponds to the law of $x_t$ under this interpolation, and its time dynamics can be described equivalently by:

A deterministic probability-flow ODE: $\mathrm{d}x_t = v(x_t, t)\,dt$
A reverse-time SDE: $\mathrm{d}x_t = v(x_t, t)\,dt + \frac{1}{2} w_t s(x_t, t)\,dt + \sqrt{w_t} d\bar{W}_t$ , for a chosen diffusion coefficient $w_t > 0$

The SiT framework exposes flexibility in how data and noise are connected, sampling is performed, and which target field (velocity or score) is learned (Ma et al., 2024).

2. Training Objectives: Velocity and Score Matching under Continuous Interpolants

SiT models are trained to regress one of two target fields at each $t$ :

Velocity matching:

$\mathcal{L}_v(\theta) = \int_0^1 \mathbb{E}\left[\| v_\theta(x_t, t) - (\dot{\alpha}_t x_0 + \dot{\sigma}_t x_1) \|^2\right]\,dt$

where $(x_0, x_1)$ are known endpoints and $\dot{\alpha}_t, \dot{\sigma}_t$ are time derivatives.

Score matching:

$\mathcal{L}_s(\theta) = \int_0^1 \mathbb{E}[\| \sigma_t s_\theta(x_t, t) + \epsilon \|^2]\,dt$

with $\epsilon \sim \mathcal{N}(0, I)$ . Weighted-score and velocity criteria can further appear, depending on interpolant properties and training choices.

Transformers are used to parameterize either the velocity $v_\theta$ or score $s_\theta$ ; in practice, velocity prediction with a linear interpolant and velocity-based SDE sampling gives the best FID scores (Ma et al., 2024).

3. Transformer Architecture: DiT Backbone and Conditional Injection

SiTs re-use the DiT (Diffusion Transformer) backbone [Peebles & Xie 2023]:

Vision Transformer operating on VAE-latent space; VAE is fixed during generative modeling.
Patch-wise latent representation (e.g., 2 $\times$ 2 grids for a given image resolution).
Class-conditional transformer blocks with AdaLN-Zero used for both class and diffusion time injection.
Model scale: S, B, L, XL variants use matching layer counts, dimension, and GFLOPs as DiT to ensure controlled ablation and fair comparison.

All architectural and computational aspects (parameters, FLOPs, latent encoding) are kept identical between DiT and SiT. Only the interpolant and learned target fields differ (Ma et al., 2024).

4. Inference: ODE and SDE Samplers under Decoupled Diffusion Schedules

Inference with SiT allows for both deterministic and stochastic sampling:

ODE-based (Probability Flow): Use Heun’s method to solve $\mathrm{d}x_t = v_\theta(x_t, t)dt$ starting from noise $x_1 \sim \mathcal{N}(0, I)$ and integrating backward to $t = 0$ . Pseudocode follows the standard two-stage Heun update.
SDE-based: Employ Euler–Maruyama with optional adaptive diffusion coefficient $w_t$ $w_{t}$ :
- Drift: $v_\theta(x_t, t) + \frac{1}{2} w_t s_\theta(x_t, t)$
- Sample: $x_{t-\Delta t} = x_t + \Delta t \,\text{drift} + \sqrt{w_t \Delta t}\,\xi,\ \xi\sim\mathcal{N}(0,I)$
- $w_t$ can be post-hoc tuned (e.g., $w_t = \sigma_t$ cancels singularity at $t \to 0$ for linear interpolation, empirically yielding lowest FID for SiT-XL).

Classifier-Free Guidance is incorporated directly in the velocity prediction of SiT:

$v_\theta^{(\zeta)} = \zeta\,v_\theta(x_t, t\mid y) + (1-\zeta)\,v_\theta(x_t, t\mid \varnothing)$

which corresponds to tempering the conditional vs. unconditional densities.

5. Effects of Interpolant Choice and Diffusion Coefficient

The interpolant function $(\alpha_t, \sigma_t)$ and the choice of $w_t$ are significant SiT hyperparameters:

Linear interpolant (i.e., $\alpha_t=1-t$ , $\sigma_t=t$ ) shortens the transport path-length versus standard Variance Preserving (VP) interpolant, simplifying learning and improving FID.
Score-based and velocity-based training are deterministically linked; either predictor suffices.
The decoupling of sampled SDE ( $w_t$ ) from the forward process allows empirical tuning for lowest FIDs.

Empirical results indicate SDE samplers consistently achieve lower FID than ODE on all interpolants. The optimal $w_t$ depends on learned target and interpolant; e.g., $w_t = \sigma_t$ minimizes FID for linear interpolant with velocity matching.

6. Scaling, Quantitative Performance, and Ablations

SiT outperforms DiT across all scales and training budgets on the (latent) ImageNet 256x256 and 512x512 benchmarks, with identical architecture/compute:

Model Variant	Params	GFLOPs	FID-50K (DiT)	FID-50K (SiT)
S	33 M	250	68.4	57.6
B	130 M	350	43.5	33.5
L	458 M	650	23.3	18.8
XL	675 M	900	19.5	17.2

With extended training and paired with CFG ( $\zeta=1.5$ ), SiT-XL establishes new state-of-the-art with FID-50K = 2.06 (Ma et al., 2024).

Ablation studies confirm:

Both continuous- and discrete-time formulations are effective, but continuous improves FID.
Weighted-score and velocity matching outperform plain score matching.
Interpolant selection and $w_t$ tuning after training further reduce generation error.

ODE and SDE samplers, modularity of losses, and freedom to tune schedules contribute to the superiority of the SiT paradigm.

7. Broader Impact and Applicability

The SiT architecture unifies and extends Transformer-based generative modeling:

Accommodates both SDE and ODE-based sampling under a common interface.
Abstracts generative dynamics away from model architecture by parameterizing the interpolant.
Facilitates systematic exploration of how time discretization, interpolant path, and model objective impact convergence and quality.
Consistently yields performance improvements over prior DiT baselines without altering model structure or compute requirements.

A plausible implication is that flow-based Transformer modeling, particularly via SiT, sets a generalizable framework for the development of scalable, modular, and efficient large-scale diffusion models (Ma et al., 2024).

Markdown Upgrade to Chat

References (1)

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flow-based Transformers SiTs.