Scalable Interpolant Transformer (SiT)

Updated 14 May 2026

SiT is a generative modeling framework that extends DiT with a stochastic-interpolant mechanism enabling controlled data-to-noise transitions.
It integrates transformer-based architectures with learnable temporal conditioning and supports both deterministic and stochastic sampling strategies.
Empirical results show SiT’s superior performance in class-conditional ImageNet generation and robust out-of-distribution medical imaging reconstruction.

The Scalable Interpolant Transformer (SiT) is a generative modeling framework that extends the Diffusion Transformer (DiT) architecture by replacing the standard variance-preserving noising process with a fully decoupled, stochastic-interpolant mechanism. SiT enables principled control over data-to-noise transitions through learnable, time-dependent interpolants and supports both deterministic and stochastic sampling strategies. Empirical results show that SiT achieves superior performance on tasks such as class-conditional ImageNet generation, with uniform improvements over DiT at identical parameter counts and computational budgets. SiT forms the backbone of advanced applications in imaging inverse problems, notably in cross-distribution priors-driven iterative reconstruction (CDPIR) for out-of-distribution (OOD) robust sparse-view CT.

1. Model Architecture and Representational Structure

SiT adopts the transformer-based backbone configuration of DiT, typically instantiated as a Vision Transformer (ViT) with modifications for generative modeling. For large-scale medical imaging (CDPIR), the SiT-Big variant (CDPIR-B-2) utilizes 12 transformer blocks (depth = 12), a hidden embedding dimension of 768, 12 attention heads per block, and an MLP (feed-forward) inner dimension of approximately 3072, amounting to 142.8M parameters. The architecture processes image inputs via patchification (patch size = 2), linear projection to $d$ -dimensional tokens, and aggregation with 1D positional embeddings. Temporal conditioning is injected through an MLP embedding of continuous diffusion time $t$ , and class-conditionality is implemented via learned embeddings for a discrete set of labels or a null token. The output is a per-token prediction of the continuous-time velocity field $v_\theta(x_t, t; c)$ , facilitating direct image-domain updates during the generative diffusion process (Li et al., 16 Sep 2025, Ma et al., 2024).

2. Stochastic Interpolant Framework

SiT's fundamental innovation lies in its unified stochastic interpolant framework, which generalizes the classic variance-preserving (VP) process. Let $x_0 \sim q(x)$ denote a data sample and $\varepsilon \sim \mathcal{N}(0, I)$ a noise vector; the forward process is defined as:

$x_t = \sqrt{\alpha_t} x_0 + \sqrt{1-\alpha_t} \varepsilon,$

with $\alpha_0=1$ , $\alpha_T \approx 0$ , and $(1-\alpha_t)$ replaced by suitable $\sigma_t$ for general interpolants. By introducing an interpolation parameter $t$ 0, the schedule $t$ 1 enables continuous mixing of priors from multiple datasets or domains via $t$ 2. This structure allows SiT to explore multiple interpolant families (e.g., linear, GVP) and to decouple design choices—including discrete/continuous time, velocity or score objectives, and the specific noise scheduling used for sampling (Ma et al., 2024, Li et al., 16 Sep 2025).

3. Objective Functions and Guidance Mechanisms

Training of SiT proceeds by direct velocity matching. The primary loss is

$t$ 3

where $t$ 4 denotes the exact velocity implied by the interpolant schedule. Each mini-batch samples images and class labels uniformly from diverse domains to promote both domain-invariant and domain-specific prior learning. No additional explicit regularization is applied beyond Adam weight decay.

Classifier-free guidance (CFG) is employed to blend features from domain-invariant and domain-specific branches. During training, with probability $t$ 5, the class token $t$ 6 is replaced by the null embedding $t$ 7. At inference, the guided velocity field is computed as

$t$ 8

with $t$ 9 being the guidance scale (set to 1.0 for CDPIR). The conversion to a guided score employs the algebraic relation between velocity and score for the designated interpolant (Li et al., 16 Sep 2025).

4. Sampling, Inference, and Reconstruction Algorithms

SiT supports both deterministic (ODE/Heun) and stochastic (reverse-time SDE, Euler–Maruyama) samplers. The reverse SDE formulation is

$v_\theta(x_t, t; c)$ 0

where $v_\theta(x_t, t; c)$ 1 is a diffusion coefficient schedule independent of training, enabling post-hoc tuning for sample quality. Pseudocode for stochastic sampling involves iterative drift updates and noise injection, with all sampling steps scalable due to the continuous-time parameterization.

In CDPIR, the iterative solver alternates between SiT-driven generative diffusion updates and data-consistency gradient steps:

Initialization: $v_\theta(x_t, t; c)$ 2
For $v_\theta(x_t, t; c)$ $v_{θ} (x_{t}, t; c)$ 3:
1. $v_\theta(x_t, t; c)$ 4
2. $v_\theta(x_t, t; c)$ 5
Output: $v_\theta(x_t, t; c)$ 6

This process enables alternating enforcement of measurement fidelity and data-driven prior denoising, leading to state-of-the-art reconstructions (Li et al., 16 Sep 2025).

5. Empirical Performance and Robustness

On class-conditional ImageNet (256×256, 400 K steps), SiT outperforms DiT across all tested model sizes with identical backbones and compute; for instance, SiT-B achieves FID-50K = 33.5 vs. DiT-B = 43.5, and SiT-XL reaches FID = 2.06 at extended training with guidance, compared to DiT-XL = 2.27. Ablations attribute these gains to the combined effects of continuous-time training, exact interpolant schedules, velocity parameterization, stochastic sampling, and tunable diffusion coefficients.

In SVCT reconstruction, CDPIR (backboned by SiT) achieves substantial OOD and in-distribution gains:

AAPM→XCAT (OOD): CDPIR PSNR ≈ 38.36 dB, SSIM ≈ 0.952 versus DDS at PSNR ≈ 32.09 dB, SSIM ≈ 0.878.
COCA→COCA (ID): CDPIR PSNR ≈ 39.67 dB, SSIM ≈ 0.958 versus DDS at PSNR ≈ 34.16 dB, SSIM ≈ 0.924.
Far-OOD PCCT: CDPIR improves PSNR by 3.5 dB and SSIM by 0.05 over the best conventional method (Li et al., 16 Sep 2025).

Disentanglement of anatomy (domain-invariant null embeddings) and texture (domain-specific labels) via CFG, together with scalable, stable velocity-based sampling, underpins SiT’s robustness and high-fidelity OOD reconstruction.

6. Significance and Modular Research Implications

SiT demonstrates that modular generative modeling—decomposing architectural and objective choices, interpolant design, and sampling schedule—leads to uniformly improved sample quality and robustness without architectural referral or increased compute. The decoupling of training and sampling diffusion coefficients enables post-training optimization of sample noise scheduling, giving further flexibility in practical deployment. The unification of continuous-velocity learning with transformer architectures and stochastic interpolant theory enables application to both density estimation (ImageNet) and complex medical imaging inverse problems (SVCT, PCCT), with state-of-the-art empirical performance and extensibility to diverse domains (Ma et al., 2024, Li et al., 16 Sep 2025).

Model	Params	FID-50K (256×256)
DiT-S	33 M	68.4
SiT-S	33 M	57.6
DiT-B	130 M	43.5
SiT-B	130 M	33.5
DiT-XL (7M, cfg=1.5)	675 M	2.27
SiT-XL (7M, cfg=1.5)	675 M	2.06

A plausible implication is that SiT’s modularity and OOD robustness motivate its deployment in applications with pronounced domain shifts or multi-source data, as well as in areas demanding fine-grained control over generative prior behavior.

Markdown Report Issue Upgrade to Chat

References (2)

Cross-Distribution Diffusion Priors-Driven Iterative Reconstruction for Sparse-View CT (2025)

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scalable Interpolant Transformer (SiT).

Scalable Interpolant Transformer (SiT)

1. Model Architecture and Representational Structure

2. Stochastic Interpolant Framework

3. Objective Functions and Guidance Mechanisms

4. Sampling, Inference, and Reconstruction Algorithms

5. Empirical Performance and Robustness

6. Significance and Modular Research Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Scalable Interpolant Transformer (SiT)

1. Model Architecture and Representational Structure

2. Stochastic Interpolant Framework

3. Objective Functions and Guidance Mechanisms

4. Sampling, Inference, and Reconstruction Algorithms

5. Empirical Performance and Robustness

6. Significance and Modular Research Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research