Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparse-vDiT: Efficient Vision Diffusion Models

Updated 10 February 2026
  • Sparse-vDiT is a method that employs aggressive token dropping and sparse–dense residual fusion to achieve up to 9.8× training speedup while maintaining or improving generation quality.
  • It utilizes structured token dropping with group-wise subsampling, preserving spatial context through mask embeddings even when up to 75% of tokens are pruned.
  • The two-stage training protocol, combining extended masked pre-training and full-token fine-tuning, effectively bridges the train-inference gap and enhances performance in video and large-scale generative models.

Sparse-vDiT denotes a class of methods and system designs that leverage sparsity—at the level of tokens, activations, or attention maps—to accelerate Vision Diffusion Transformers (ViT-based DiTs and vDiTs), with an emphasis on video and large-scale generative modeling. Through strategic token dropping, structured pattern-based masking, dynamic per-sequence pruning, and residual fusion, Sparse-vDiT achieves up to an order-of-magnitude reduction in the computation and memory requirements of Transformer-based diffusion models, often with negligible or even improved generation quality. This article consolidates algorithmic principles, practical implementations, empirical results, and system-level considerations for Sparse-vDiT, with primary reference to the SPRINT workflow as detailed in (Park et al., 24 Oct 2025) as well as broader context on video DiT sparsification.

1. Architectural Principles and Core Design

The prototype Sparse-vDiT follows the Sparse–Dense Residual Fusion paradigm instantiated by the SPRINT algorithm (Park et al., 24 Oct 2025). A standard backbone consists of LL DiT (or SiT) blocks processing NN tokens of dimension CC. The LL blocks are partitioned into three segments:

  • Encoder fθf_\theta: First EE blocks, operating on all NN (dense) tokens.
  • Middle blocks gθg_\theta: Next MM blocks, performing computation only on a sparse subset of N=(1r)NN' = \lfloor (1-r)N \rfloor tokens. The drop ratio rr can reach 75%.
  • Decoder hθh_\theta: Final DD blocks, restoring computation over all NN tokens.

During pre-training, the model processes noisy latent xtRB×N×Cx_t\in\mathbb{R}^{B\times N\times C} through fθf_\theta to obtain Z1Z_1, from which rNr\cdot N tokens are dropped before forwarding through gθg_\theta. After gθg_\theta, outputs are padded back to NN tokens (using a fixed mask embedding MM for dropped positions), and the dense path Z1Z_1 is fused with the sparse path Z2padZ_2^{pad} via channel concatenation and a 1×11\times 1 convolution (the fusion layer). This fused sequence is then passed to hθh_\theta for the final prediction.

2. Token Dropping Formulations

Sparse-vDiT's token dropping uses structured, group-wise subsampling. For image inputs, the H×WH\times W grid is partitioned into n×nn\times n non-overlapping groups, and within each group, kk tokens are kept (r=1k/n2r=1-k/n^2). For instance, n=2n=2, k=1k=1 gives a 75% drop ratio. A binary mask m{0,1}Nm\in\{0,1\}^N indicates which tokens are retained. After passing through gθg_\theta, outputs are mapped to a length-NN padded sequence by inserting the mask embedding MM at dropped positions.

This structured approach preserves spatial locality and allows for aggressive dropping while minimizing the loss of critical context.

3. Sparse–Dense Residual Fusion Mechanism

After shallow dense processing and deeper sparse processing, outputs of both paths—Z1Z_1 (dense, shallow) and Z2padZ_2^{pad} (sparse, deep)—are concatenated along the channel dimension. The fused tensor

Zfuse=Wfuse[Z1;Z2pad]+bfuseZ_{fuse} = W_{fuse} \cdot [Z_1; Z_2^{pad}] + b_{fuse}

where WfuseW_{fuse} is a learnable 1×11\times 1 convolution, is input to hθh_\theta. This fusion enables the model to combine local detail (from the dense path) with global, computationally-cheap context (from the sparse path). For vision diffusion models, this strategy is empirically shown to accelerate convergence and yield higher feature richness under sparse regimes.

4. Training Protocol: Two-Stage Masked Pre-training and Fine-tuning

SPRINT adopts a two-stage training schedule (Park et al., 24 Oct 2025):

Stage 1: Extended masked pre-training for T1T_1 iterations (e.g., 1 million), using a fixed drop ratio r=75%r=75\% in gθg_\theta, and injecting "path-drop learning" where, with 10% probability, the deep sparse path is replaced entirely by the mask token. Loss is standard flow/score-matching.

Stage 2: Full-token fine-tuning for T2T_2 short iterations (100\sim 100–200K), setting r=0r=0 (restoring dense computation in gθg_\theta) and maintaining path-drop at 10% to stabilize performance in inference settings (specifically, PDG—Path-Drop Guidance). This closes the train-inference gap and gives deep layers exposure to full contextual tokens before deployment.

5. Empirical Performance and FLOP Analysis

On ImageNet-1K at 256×256256\times 256:

  • Baseline SiT-XL/2: FDD = 79.5, FID = 2.06, training cost 427.7M FLOPs, inference 0.475T FLOPs.
  • SPRINT (Sparse-vDiT): ~677M params (0.3% overhead), 43.7M training FLOPs (9.8×\times speedup), inference 0.477T FLOPs, FID = 2.01.
  • SPRINT + PDG: Further reduces inference FLOPs to 0.274T (43% less), improves FDD to 63.1 and FID to 1.82 compared to classifier-free guidance (CFG).

The critical efficiency gains are:

Setting Training FLOPs Inference FLOPs FDD FID Speedup
Baseline 427.7M 0.475T 79.5 2.06
SPRINT 43.7M 0.477T 79.0 2.01 9.8×
SPRINT+PDG 43.7M 0.274T 63.1 1.82 9.8×/1.74×

Visual and quantitative sample quality is preserved or improved even at high drop ratios, and both training and inference benefit from significant computational savings (Park et al., 24 Oct 2025).

6. Pseudocode and Implementation

Sparse-vDiT can be implemented as follows:

Pre-training:

1
2
3
4
5
6
7
8
9
10
11
x_t = (1-t)x_0 + t*epsilon
Z1 = f_theta(x_t, c)
Z1_drop = Drop(Z1, r)
Z2_drop = g_theta(Z1_drop, c)
Z2_pad = PadWithMask(Z2_drop, N)
if rand() < p:
    Z2_pad = MaskToken
Z_fuse = Fuse(Z1, Z2_pad)
hat_v = h_theta(Z_fuse, c)
L = ||hat_v - v||^2
update theta
Fine-tuning: Identical but Drop(·) becomes a no-op (r=0).

Inference with PDG: For each step, compute

1
2
3
4
v_cond = h_theta(Fuse(g_theta(f_theta(x_t, c)), f_theta(x_t, c)), c)
v_uncond = h_theta(Fuse(MaskToken, f_theta(x_t, )), )
v = v_uncond + w*(v_cond - v_uncond)
step the sampler

Model architecture statistics: For SiT-XL/2+SPRINT: total blocks = 28 ($2(f) + 24(g) + 2(h)$), hidden dim = 1152, heads = 16, fusion layer is a 0.3% parameter overhead (Park et al., 24 Oct 2025).

7. Comparison, Context, and Extensions

SPRINT's Sparse-vDiT shares high-level goals with other sparsification strategies for ViT-based models:

  • Layer-wise window activation pruning (e.g., SparseViT) utilizes stagewise sparsity and evolutionary search for optimal pruning (Chen et al., 2023), but typically focuses on window-based vision tasks rather than generative diffusion.
  • Learnable token pruning (e.g., Adaptive Sparse ViT) integrates per-instanced adaptive gating using attention-based scores and budget-aware fine-tuning, yielding substantial FLOPs reduction and adaptive computation per input (Liu et al., 2022).
  • Sparse regularization + prune (e.g., Sparse then Prune ViT) leverages activation sparsity during pre-training and post-training unstructured pruning to recover accuracy after weight deletion (Prasetyo et al., 2023).
  • End-to-end dynamic sparsity (e.g., SViTE) introduces joint optimization of parameter and data sparsity, providing "free lunch" regularization effects and enabling near-half reduction in FLOPs for classification without sacrificing accuracy (Chen et al., 2021).

SPRINT's path-based fusion with two-stage regime is unique in fusing dense and sparse feature streams at scale for diffusion models, with aggressive token drop ratios (75%), masking, and residual fusion as key innovations. Inference savings are extended via PDG by omitting deep path computation on the unconditional sampling pass.

The approach is model-agnostic and applicable to any DiT/SiT backbone. The division of labor between local/shallow (dense) and global/deep (sparse) blocks, complemented by a channel-wise fusion, enables both efficient training and robust sample quality in large generative models. These principles readily extend to video diffusion, multi-modal, and high-resolution settings typical of state-of-the-art generative transformers.

References

  • SPRINT: "Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers" (Park et al., 24 Oct 2025)
  • SparseViT: "SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer" (Chen et al., 2023)
  • Adaptive Sparse ViT: "Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention" (Liu et al., 2022)
  • Sparse then Prune: "Sparse then Prune: Toward Efficient Vision Transformers" (Prasetyo et al., 2023)
  • SViTE: "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" (Chen et al., 2021)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse-vDiT.