Papers
Topics
Authors
Recent
Search
2000 character limit reached

Flow-based Transformers SiTs

Updated 15 January 2026
  • The paper introduces SiT, a unifying framework that leverages continuous-time flow and score matching to significantly improve FID metrics compared to previous latent diffusion models.
  • SiTs employ both velocity and score matching using a DiT backbone, enabling efficient deterministic (ODE) and stochastic (SDE) sampling within a modular interpolation setting.
  • SiTs consistently outperform DiT baselines on ImageNet benchmarks by achieving lower FID scores while maintaining architectural consistency and scalability.

Flow-based Transformers, and specifically the Scalable Interpolant Transformers (SiTs), represent a unifying framework that merges continuous-time flow-based generative modeling with the Transformer architecture. SiTs enable both flow-matching (probability flow ODE) and score-based diffusion paradigms within a single, modular setting. They expand the generative capacity and scalability of Transformer-based models for high-fidelity and efficient large-scale synthesis. The SiT framework is most extensively detailed in "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers" (Ma et al., 2024), which systematically analyzes and outperforms prior latent diffusion Transformer models.

1. Foundations: Stochastic Interpolant Transformers (SiTs) and Flow-based Modeling

A SiT connects data distribution pdatap_{\mathrm{data}} and latent noise distribution N(0,I)\mathcal{N}(0, I) through a parametrized stochastic interpolant:

xt=αtx0+σtx1,t[0,1],(x0pdata, x1N(0,I))x_t = \alpha_t x_0 + \sigma_t x_1, \quad t \in [0,1], \quad (x_0 \sim p_{\mathrm{data}},\ x_1 \sim \mathcal{N}(0, I))

where αt,σt\alpha_t, \sigma_t are smooth, valid schedule functions with α0=1,σ0=0\alpha_0 = 1, \sigma_0 = 0 (at data), α1=0,σ1=1\alpha_1 = 0, \sigma_1 = 1 (at pure noise), and αt2+σt2>0\alpha_t^2 + \sigma_t^2 > 0 throughout [0,1][0,1].

This generalizes classic denoising diffusion probabilistic models (DDPM), enabling both ODE-based (flow) and SDE-based (diffusion) perspectives under a single formalism. The marginal at time tt, pt(x)p_t(x), corresponds to the law of xtx_t under this interpolation, and its time dynamics can be described equivalently by:

  • A deterministic probability-flow ODE: dxt=v(xt,t)dt\mathrm{d}x_t = v(x_t, t)\,dt
  • A reverse-time SDE: dxt=v(xt,t)dt+12wts(xt,t)dt+wtdWˉt\mathrm{d}x_t = v(x_t, t)\,dt + \frac{1}{2} w_t s(x_t, t)\,dt + \sqrt{w_t} d\bar{W}_t, for a chosen diffusion coefficient wt>0w_t > 0

The SiT framework exposes flexibility in how data and noise are connected, sampling is performed, and which target field (velocity or score) is learned (Ma et al., 2024).

2. Training Objectives: Velocity and Score Matching under Continuous Interpolants

SiT models are trained to regress one of two target fields at each tt:

  1. Velocity matching:

Lv(θ)=01E[vθ(xt,t)(α˙tx0+σ˙tx1)2]dt\mathcal{L}_v(\theta) = \int_0^1 \mathbb{E}\left[\| v_\theta(x_t, t) - (\dot{\alpha}_t x_0 + \dot{\sigma}_t x_1) \|^2\right]\,dt

where (x0,x1)(x_0, x_1) are known endpoints and α˙t,σ˙t\dot{\alpha}_t, \dot{\sigma}_t are time derivatives.

  1. Score matching:

Ls(θ)=01E[σtsθ(xt,t)+ϵ2]dt\mathcal{L}_s(\theta) = \int_0^1 \mathbb{E}[\| \sigma_t s_\theta(x_t, t) + \epsilon \|^2]\,dt

with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I). Weighted-score and velocity criteria can further appear, depending on interpolant properties and training choices.

Transformers are used to parameterize either the velocity vθv_\theta or score sθs_\theta; in practice, velocity prediction with a linear interpolant and velocity-based SDE sampling gives the best FID scores (Ma et al., 2024).

3. Transformer Architecture: DiT Backbone and Conditional Injection

SiTs re-use the DiT (Diffusion Transformer) backbone [Peebles & Xie 2023]:

  • Vision Transformer operating on VAE-latent space; VAE is fixed during generative modeling.
  • Patch-wise latent representation (e.g., 2×\times2 grids for a given image resolution).
  • Class-conditional transformer blocks with AdaLN-Zero used for both class and diffusion time injection.
  • Model scale: S, B, L, XL variants use matching layer counts, dimension, and GFLOPs as DiT to ensure controlled ablation and fair comparison.

All architectural and computational aspects (parameters, FLOPs, latent encoding) are kept identical between DiT and SiT. Only the interpolant and learned target fields differ (Ma et al., 2024).

4. Inference: ODE and SDE Samplers under Decoupled Diffusion Schedules

Inference with SiT allows for both deterministic and stochastic sampling:

  • ODE-based (Probability Flow): Use Heun’s method to solve dxt=vθ(xt,t)dt\mathrm{d}x_t = v_\theta(x_t, t)dt starting from noise x1N(0,I)x_1 \sim \mathcal{N}(0, I) and integrating backward to t=0t = 0. Pseudocode follows the standard two-stage Heun update.
  • SDE-based: Employ Euler–Maruyama with optional adaptive diffusion coefficient wtw_t:
    • Drift: vθ(xt,t)+12wtsθ(xt,t)v_\theta(x_t, t) + \frac{1}{2} w_t s_\theta(x_t, t)
    • Sample: xtΔt=xt+Δtdrift+wtΔtξ, ξN(0,I)x_{t-\Delta t} = x_t + \Delta t \,\text{drift} + \sqrt{w_t \Delta t}\,\xi,\ \xi\sim\mathcal{N}(0,I)
    • wtw_t can be post-hoc tuned (e.g., wt=σtw_t = \sigma_t cancels singularity at t0t \to 0 for linear interpolation, empirically yielding lowest FID for SiT-XL).

Classifier-Free Guidance is incorporated directly in the velocity prediction of SiT:

vθ(ζ)=ζvθ(xt,ty)+(1ζ)vθ(xt,t)v_\theta^{(\zeta)} = \zeta\,v_\theta(x_t, t\mid y) + (1-\zeta)\,v_\theta(x_t, t\mid \varnothing)

which corresponds to tempering the conditional vs. unconditional densities.

5. Effects of Interpolant Choice and Diffusion Coefficient

The interpolant function (αt,σt)(\alpha_t, \sigma_t) and the choice of wtw_t are significant SiT hyperparameters:

  • Linear interpolant (i.e., αt=1t\alpha_t=1-t, σt=t\sigma_t=t) shortens the transport path-length versus standard Variance Preserving (VP) interpolant, simplifying learning and improving FID.
  • Score-based and velocity-based training are deterministically linked; either predictor suffices.
  • The decoupling of sampled SDE (wtw_t) from the forward process allows empirical tuning for lowest FIDs.

Empirical results indicate SDE samplers consistently achieve lower FID than ODE on all interpolants. The optimal wtw_t depends on learned target and interpolant; e.g., wt=σtw_t = \sigma_t minimizes FID for linear interpolant with velocity matching.

6. Scaling, Quantitative Performance, and Ablations

SiT outperforms DiT across all scales and training budgets on the (latent) ImageNet 256x256 and 512x512 benchmarks, with identical architecture/compute:

Model Variant Params GFLOPs FID-50K (DiT) FID-50K (SiT)
S 33 M 250 68.4 57.6
B 130 M 350 43.5 33.5
L 458 M 650 23.3 18.8
XL 675 M 900 19.5 17.2

With extended training and paired with CFG (ζ=1.5\zeta=1.5), SiT-XL establishes new state-of-the-art with FID-50K = 2.06 (Ma et al., 2024).

Ablation studies confirm:

  • Both continuous- and discrete-time formulations are effective, but continuous improves FID.
  • Weighted-score and velocity matching outperform plain score matching.
  • Interpolant selection and wtw_t tuning after training further reduce generation error.

ODE and SDE samplers, modularity of losses, and freedom to tune schedules contribute to the superiority of the SiT paradigm.

7. Broader Impact and Applicability

The SiT architecture unifies and extends Transformer-based generative modeling:

  • Accommodates both SDE and ODE-based sampling under a common interface.
  • Abstracts generative dynamics away from model architecture by parameterizing the interpolant.
  • Facilitates systematic exploration of how time discretization, interpolant path, and model objective impact convergence and quality.
  • Consistently yields performance improvements over prior DiT baselines without altering model structure or compute requirements.

A plausible implication is that flow-based Transformer modeling, particularly via SiT, sets a generalizable framework for the development of scalable, modular, and efficient large-scale diffusion models (Ma et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flow-based Transformers SiTs.