Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scalable Interpolant Transformer (SiT)

Updated 14 May 2026
  • SiT is a generative modeling framework that extends DiT with a stochastic-interpolant mechanism enabling controlled data-to-noise transitions.
  • It integrates transformer-based architectures with learnable temporal conditioning and supports both deterministic and stochastic sampling strategies.
  • Empirical results show SiT’s superior performance in class-conditional ImageNet generation and robust out-of-distribution medical imaging reconstruction.

The Scalable Interpolant Transformer (SiT) is a generative modeling framework that extends the Diffusion Transformer (DiT) architecture by replacing the standard variance-preserving noising process with a fully decoupled, stochastic-interpolant mechanism. SiT enables principled control over data-to-noise transitions through learnable, time-dependent interpolants and supports both deterministic and stochastic sampling strategies. Empirical results show that SiT achieves superior performance on tasks such as class-conditional ImageNet generation, with uniform improvements over DiT at identical parameter counts and computational budgets. SiT forms the backbone of advanced applications in imaging inverse problems, notably in cross-distribution priors-driven iterative reconstruction (CDPIR) for out-of-distribution (OOD) robust sparse-view CT.

1. Model Architecture and Representational Structure

SiT adopts the transformer-based backbone configuration of DiT, typically instantiated as a Vision Transformer (ViT) with modifications for generative modeling. For large-scale medical imaging (CDPIR), the SiT-Big variant (CDPIR-B-2) utilizes 12 transformer blocks (depth = 12), a hidden embedding dimension of 768, 12 attention heads per block, and an MLP (feed-forward) inner dimension of approximately 3072, amounting to 142.8M parameters. The architecture processes image inputs via patchification (patch size = 2), linear projection to dd-dimensional tokens, and aggregation with 1D positional embeddings. Temporal conditioning is injected through an MLP embedding of continuous diffusion time tt, and class-conditionality is implemented via learned embeddings for a discrete set of labels or a null token. The output is a per-token prediction of the continuous-time velocity field vθ(xt,t;c)v_\theta(x_t, t; c), facilitating direct image-domain updates during the generative diffusion process (Li et al., 16 Sep 2025, Ma et al., 2024).

2. Stochastic Interpolant Framework

SiT's fundamental innovation lies in its unified stochastic interpolant framework, which generalizes the classic variance-preserving (VP) process. Let x0q(x)x_0 \sim q(x) denote a data sample and εN(0,I)\varepsilon \sim \mathcal{N}(0, I) a noise vector; the forward process is defined as:

xt=αtx0+1αtε,x_t = \sqrt{\alpha_t} x_0 + \sqrt{1-\alpha_t} \varepsilon,

with α0=1\alpha_0=1, αT0\alpha_T \approx 0, and (1αt)(1-\alpha_t) replaced by suitable σt\sigma_t for general interpolants. By introducing an interpolation parameter tt0, the schedule tt1 enables continuous mixing of priors from multiple datasets or domains via tt2. This structure allows SiT to explore multiple interpolant families (e.g., linear, GVP) and to decouple design choices—including discrete/continuous time, velocity or score objectives, and the specific noise scheduling used for sampling (Ma et al., 2024, Li et al., 16 Sep 2025).

3. Objective Functions and Guidance Mechanisms

Training of SiT proceeds by direct velocity matching. The primary loss is

tt3

where tt4 denotes the exact velocity implied by the interpolant schedule. Each mini-batch samples images and class labels uniformly from diverse domains to promote both domain-invariant and domain-specific prior learning. No additional explicit regularization is applied beyond Adam weight decay.

Classifier-free guidance (CFG) is employed to blend features from domain-invariant and domain-specific branches. During training, with probability tt5, the class token tt6 is replaced by the null embedding tt7. At inference, the guided velocity field is computed as

tt8

with tt9 being the guidance scale (set to 1.0 for CDPIR). The conversion to a guided score employs the algebraic relation between velocity and score for the designated interpolant (Li et al., 16 Sep 2025).

4. Sampling, Inference, and Reconstruction Algorithms

SiT supports both deterministic (ODE/Heun) and stochastic (reverse-time SDE, Euler–Maruyama) samplers. The reverse SDE formulation is

vθ(xt,t;c)v_\theta(x_t, t; c)0

where vθ(xt,t;c)v_\theta(x_t, t; c)1 is a diffusion coefficient schedule independent of training, enabling post-hoc tuning for sample quality. Pseudocode for stochastic sampling involves iterative drift updates and noise injection, with all sampling steps scalable due to the continuous-time parameterization.

In CDPIR, the iterative solver alternates between SiT-driven generative diffusion updates and data-consistency gradient steps:

  • Initialization: vθ(xt,t;c)v_\theta(x_t, t; c)2
  • For vθ(xt,t;c)v_\theta(x_t, t; c)3:

    1. vθ(xt,t;c)v_\theta(x_t, t; c)4
    2. vθ(xt,t;c)v_\theta(x_t, t; c)5
  • Output: vθ(xt,t;c)v_\theta(x_t, t; c)6

This process enables alternating enforcement of measurement fidelity and data-driven prior denoising, leading to state-of-the-art reconstructions (Li et al., 16 Sep 2025).

5. Empirical Performance and Robustness

On class-conditional ImageNet (256×256, 400 K steps), SiT outperforms DiT across all tested model sizes with identical backbones and compute; for instance, SiT-B achieves FID-50K = 33.5 vs. DiT-B = 43.5, and SiT-XL reaches FID = 2.06 at extended training with guidance, compared to DiT-XL = 2.27. Ablations attribute these gains to the combined effects of continuous-time training, exact interpolant schedules, velocity parameterization, stochastic sampling, and tunable diffusion coefficients.

In SVCT reconstruction, CDPIR (backboned by SiT) achieves substantial OOD and in-distribution gains:

  • AAPM→XCAT (OOD): CDPIR PSNR ≈ 38.36 dB, SSIM ≈ 0.952 versus DDS at PSNR ≈ 32.09 dB, SSIM ≈ 0.878.
  • COCA→COCA (ID): CDPIR PSNR ≈ 39.67 dB, SSIM ≈ 0.958 versus DDS at PSNR ≈ 34.16 dB, SSIM ≈ 0.924.
  • Far-OOD PCCT: CDPIR improves PSNR by 3.5 dB and SSIM by 0.05 over the best conventional method (Li et al., 16 Sep 2025).

Disentanglement of anatomy (domain-invariant null embeddings) and texture (domain-specific labels) via CFG, together with scalable, stable velocity-based sampling, underpins SiT’s robustness and high-fidelity OOD reconstruction.

6. Significance and Modular Research Implications

SiT demonstrates that modular generative modeling—decomposing architectural and objective choices, interpolant design, and sampling schedule—leads to uniformly improved sample quality and robustness without architectural referral or increased compute. The decoupling of training and sampling diffusion coefficients enables post-training optimization of sample noise scheduling, giving further flexibility in practical deployment. The unification of continuous-velocity learning with transformer architectures and stochastic interpolant theory enables application to both density estimation (ImageNet) and complex medical imaging inverse problems (SVCT, PCCT), with state-of-the-art empirical performance and extensibility to diverse domains (Ma et al., 2024, Li et al., 16 Sep 2025).

Model Params FID-50K (256×256)
DiT-S 33 M 68.4
SiT-S 33 M 57.6
DiT-B 130 M 43.5
SiT-B 130 M 33.5
DiT-XL (7M, cfg=1.5) 675 M 2.27
SiT-XL (7M, cfg=1.5) 675 M 2.06

A plausible implication is that SiT’s modularity and OOD robustness motivate its deployment in applications with pronounced domain shifts or multi-source data, as well as in areas demanding fine-grained control over generative prior behavior.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scalable Interpolant Transformer (SiT).