Sparse-vDiT: Efficient Vision Diffusion Models

Updated 10 February 2026

Sparse-vDiT is a method that employs aggressive token dropping and sparse–dense residual fusion to achieve up to 9.8× training speedup while maintaining or improving generation quality.
It utilizes structured token dropping with group-wise subsampling, preserving spatial context through mask embeddings even when up to 75% of tokens are pruned.
The two-stage training protocol, combining extended masked pre-training and full-token fine-tuning, effectively bridges the train-inference gap and enhances performance in video and large-scale generative models.

Sparse-vDiT denotes a class of methods and system designs that leverage sparsity—at the level of tokens, activations, or attention maps—to accelerate Vision Diffusion Transformers (ViT-based DiTs and vDiTs), with an emphasis on video and large-scale generative modeling. Through strategic token dropping, structured pattern-based masking, dynamic per-sequence pruning, and residual fusion, Sparse-vDiT achieves up to an order-of-magnitude reduction in the computation and memory requirements of Transformer-based diffusion models, often with negligible or even improved generation quality. This article consolidates algorithmic principles, practical implementations, empirical results, and system-level considerations for Sparse-vDiT, with primary reference to the SPRINT workflow as detailed in (Park et al., 24 Oct 2025) as well as broader context on video DiT sparsification.

1. Architectural Principles and Core Design

The prototype Sparse-vDiT follows the Sparse–Dense Residual Fusion paradigm instantiated by the SPRINT algorithm (Park et al., 24 Oct 2025). A standard backbone consists of $L$ DiT (or SiT) blocks processing $N$ tokens of dimension $C$ . The $L$ blocks are partitioned into three segments:

Encoder $f_\theta$ : First $E$ blocks, operating on all $N$ (dense) tokens.
Middle blocks $g_\theta$ : Next $M$ blocks, performing computation only on a sparse subset of $N' = \lfloor (1-r)N \rfloor$ tokens. The drop ratio $r$ can reach 75%.
Decoder $h_\theta$ : Final $D$ blocks, restoring computation over all $N$ tokens.

During pre-training, the model processes noisy latent $x_t\in\mathbb{R}^{B\times N\times C}$ through $f_\theta$ to obtain $Z_1$ , from which $r\cdot N$ tokens are dropped before forwarding through $g_\theta$ . After $g_\theta$ , outputs are padded back to $N$ tokens (using a fixed mask embedding $M$ for dropped positions), and the dense path $Z_1$ is fused with the sparse path $Z_2^{pad}$ via channel concatenation and a $1\times 1$ convolution (the fusion layer). This fused sequence is then passed to $h_\theta$ for the final prediction.

2. Token Dropping Formulations

Sparse-vDiT's token dropping uses structured, group-wise subsampling. For image inputs, the $H\times W$ grid is partitioned into $n\times n$ non-overlapping groups, and within each group, $k$ tokens are kept ( $r=1-k/n^2$ ). For instance, $n=2$ , $k=1$ gives a 75% drop ratio. A binary mask $m\in\{0,1\}^N$ indicates which tokens are retained. After passing through $g_\theta$ , outputs are mapped to a length- $N$ padded sequence by inserting the mask embedding $M$ at dropped positions.

This structured approach preserves spatial locality and allows for aggressive dropping while minimizing the loss of critical context.

3. Sparse–Dense Residual Fusion Mechanism

After shallow dense processing and deeper sparse processing, outputs of both paths— $Z_1$ (dense, shallow) and $Z_2^{pad}$ (sparse, deep)—are concatenated along the channel dimension. The fused tensor

$Z_{fuse} = W_{fuse} \cdot [Z_1; Z_2^{pad}] + b_{fuse}$

where $W_{fuse}$ is a learnable $1\times 1$ convolution, is input to $h_\theta$ . This fusion enables the model to combine local detail (from the dense path) with global, computationally-cheap context (from the sparse path). For vision diffusion models, this strategy is empirically shown to accelerate convergence and yield higher feature richness under sparse regimes.

4. Training Protocol: Two-Stage Masked Pre-training and Fine-tuning

SPRINT adopts a two-stage training schedule (Park et al., 24 Oct 2025):

Stage 1: Extended masked pre-training for $T_1$ iterations (e.g., 1 million), using a fixed drop ratio $r=75\%$ in $g_\theta$ , and injecting "path-drop learning" where, with 10% probability, the deep sparse path is replaced entirely by the mask token. Loss is standard flow/score-matching.

Stage 2: Full-token fine-tuning for $T_2$ short iterations ( $\sim 100$ –200K), setting $r=0$ (restoring dense computation in $g_\theta$ ) and maintaining path-drop at 10% to stabilize performance in inference settings (specifically, PDG—Path-Drop Guidance). This closes the train-inference gap and gives deep layers exposure to full contextual tokens before deployment.

5. Empirical Performance and FLOP Analysis

On ImageNet-1K at $256\times 256$ :

Baseline SiT-XL/2: FDD = 79.5, FID = 2.06, training cost 427.7M FLOPs, inference 0.475T FLOPs.
SPRINT (Sparse-vDiT): ~677M params (0.3% overhead), 43.7M training FLOPs (9.8 $\times$ speedup), inference 0.477T FLOPs, FID = 2.01.
SPRINT + PDG: Further reduces inference FLOPs to 0.274T (43% less), improves FDD to 63.1 and FID to 1.82 compared to classifier-free guidance (CFG).

The critical efficiency gains are:

Setting	Training FLOPs	Inference FLOPs	FDD	FID	Speedup
Baseline	427.7M	0.475T	79.5	2.06	1×
SPRINT	43.7M	0.477T	79.0	2.01	9.8×
SPRINT+PDG	43.7M	0.274T	63.1	1.82	9.8×/1.74×

Visual and quantitative sample quality is preserved or improved even at high drop ratios, and both training and inference benefit from significant computational savings (Park et al., 24 Oct 2025).

6. Pseudocode and Implementation

Sparse-vDiT can be implemented as follows:

Pre-training:

x_t = (1-t)x_0 + t*epsilon
Z1 = f_theta(x_t, c)
Z1_drop = Drop(Z1, r)
Z2_drop = g_theta(Z1_drop, c)
Z2_pad = PadWithMask(Z2_drop, N)
if rand() < p:
    Z2_pad = MaskToken
Z_fuse = Fuse(Z1, Z2_pad)
hat_v = h_theta(Z_fuse, c)
L = ||hat_v - v||^2
update theta

Fine-tuning: Identical but Drop(·) becomes a no-op (r=0).

Inference with PDG: For each step, compute

v_cond = h_theta(Fuse(g_theta(f_theta(x_t, c)), f_theta(x_t, c)), c)
v_uncond = h_theta(Fuse(MaskToken, f_theta(x_t, ∅)), ∅)
v = v_uncond + w*(v_cond - v_uncond)
step the sampler

Model architecture statistics: For SiT-XL/2+SPRINT: total blocks = 28 ($2(f) + 24(g) + 2(h)$), hidden dim = 1152, heads = 16, fusion layer is a 0.3% parameter overhead (Park et al., 24 Oct 2025).

7. Comparison, Context, and Extensions

SPRINT's Sparse-vDiT shares high-level goals with other sparsification strategies for ViT-based models:

Layer-wise window activation pruning (e.g., SparseViT) utilizes stagewise sparsity and evolutionary search for optimal pruning (Chen et al., 2023), but typically focuses on window-based vision tasks rather than generative diffusion.
Learnable token pruning (e.g., Adaptive Sparse ViT) integrates per-instanced adaptive gating using attention-based scores and budget-aware fine-tuning, yielding substantial FLOPs reduction and adaptive computation per input (Liu et al., 2022).
Sparse regularization + prune (e.g., Sparse then Prune ViT) leverages activation sparsity during pre-training and post-training unstructured pruning to recover accuracy after weight deletion (Prasetyo et al., 2023).
End-to-end dynamic sparsity (e.g., SViTE) introduces joint optimization of parameter and data sparsity, providing "free lunch" regularization effects and enabling near-half reduction in FLOPs for classification without sacrificing accuracy (Chen et al., 2021).

SPRINT's path-based fusion with two-stage regime is unique in fusing dense and sparse feature streams at scale for diffusion models, with aggressive token drop ratios (75%), masking, and residual fusion as key innovations. Inference savings are extended via PDG by omitting deep path computation on the unconditional sampling pass.

The approach is model-agnostic and applicable to any DiT/SiT backbone. The division of labor between local/shallow (dense) and global/deep (sparse) blocks, complemented by a channel-wise fusion, enables both efficient training and robust sample quality in large generative models. These principles readily extend to video diffusion, multi-modal, and high-resolution settings typical of state-of-the-art generative transformers.

References

SPRINT: "Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers" (Park et al., 24 Oct 2025)
SparseViT: "SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer" (Chen et al., 2023)
Adaptive Sparse ViT: "Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention" (Liu et al., 2022)
Sparse then Prune: "Sparse then Prune: Toward Efficient Vision Transformers" (Prasetyo et al., 2023)
SViTE: "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" (Chen et al., 2021)