Papers
Topics
Authors
Recent
2000 character limit reached

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention (2509.24006v1)

Published 28 Sep 2025 in cs.LG, cs.AI, and cs.CV

Abstract: In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

Summary

  • The paper introduces SLA, a mechanism that combines block-sparse and linear attention to cut computation by 95% without quality loss.
  • It partitions attention weights into critical, marginal, and negligible groups, achieving 13.7× kernel and 2.2× end-to-end speedups.
  • The method leverages fused GPU kernels and dynamic block partitioning, making it practical for large-scale video and image diffusion models.

SLA: Fine-Tunable Sparse-Linear Attention for Efficient Diffusion Transformers

Introduction and Motivation

Diffusion Transformers (DiTs) have become the de facto architecture for high-fidelity video and image generation, but their scalability is fundamentally constrained by the quadratic complexity of the attention mechanism, especially as sequence lengths reach tens or hundreds of thousands. Existing approaches to efficient attention fall into two categories: sparse attention, which masks out most attention scores, and linear attention, which reformulates the computation to achieve O(N)\mathcal{O}(N) complexity. However, both approaches have critical limitations in the context of video diffusion: linear attention alone leads to severe quality degradation, while sparse attention cannot achieve high enough sparsity without significant loss in fidelity.

A key empirical observation in this work is that attention weights in DiTs are highly skewed: a small fraction of weights are large and high-rank, while the vast majority are extremely small and low-rank. This motivates a hybrid approach that applies full (sparse) attention to the critical weights and linear attention to the marginal ones, skipping the negligible entries entirely. Figure 1

Figure 1: The left figure shows a typical distribution of attention weights sampled from the Wan2.1 model. The right figure shows the accuracy of sparse attention with different sparsity.

SLA: Sparse-Linear Attention Mechanism

SLA (Sparse-Linear Attention) is a trainable hybrid attention mechanism that partitions the attention weight matrix into three categories:

  • Critical: Top kh%k_h\% of attention blocks, computed exactly using block-sparse FlashAttention.
  • Marginal: Middle km%k_m\% of blocks, approximated using linear attention.
  • Negligible: Bottom kl%k_l\% of blocks, skipped entirely.

This partitioning is performed dynamically using a compressed attention matrix PcP_c obtained via mean pooling and softmax over block representations. The mask McM_c encodes the assignment of each block to one of the three categories. Figure 2

Figure 2: Overview of SLA. The left figure illustrates the high-level idea: attention weights are classified into three categories and assigned to computations of different complexity. The right figure shows the detailed forward algorithm of SLA using the predicted compressed attention weights.

Sparse and Linear Components

  • Sparse Attention: For blocks marked as critical, standard block-sparse attention is computed using FlashAttention, ensuring high-rank structure is preserved where it matters most.
  • Linear Attention: For marginal blocks, a linear attention variant is applied, leveraging the low-rank structure of these entries. The output is projected via a learnable linear transformation to mitigate distribution mismatch.
  • Fusion: The outputs of the sparse and linear components are summed, with the linear component acting as a learnable compensation rather than a direct approximation.

This design enables SLA to achieve extremely high sparsity (up to 95%) while maintaining generation quality, as the computationally expensive O(N2)\mathcal{O}(N^2) operations are reserved only for the most important attention blocks. Figure 3

Figure 3: Decomposition of attention weights. The left shows the full weights, the middle the top 8%, and the right the bottom 92% (low-rank structure).

Implementation and Optimization

SLA is implemented as a fused GPU kernel supporting both forward and backward passes, with several optimizations:

  • Lookup Tables: For high sparsity, nonzero block indices are precomputed to minimize memory access overhead.
  • Pre-aggregation: For linear attention, row/column sums are precomputed to reduce redundant additions.
  • Method of Four Russians: For intermediate sparsity, subset sums are precomputed for efficient block aggregation.

The forward pass involves block-wise computation of sparse and linear attention, with precomputation of intermediate results for the linear component. The backward pass fuses gradient computation for both components, following the chain rule and leveraging block structure for efficiency.

Empirical Results

SLA is evaluated on the Wan2.1-1.3B video diffusion model and LightningDiT for image generation. Key results include:

  • Attention Computation Reduction: SLA achieves a 95% reduction in attention computation (FLOPs) at 95% sparsity, with no degradation in video or image quality.
  • Kernel and End-to-End Speedup: SLA delivers a 13.7× speedup in the attention kernel and a 2.2× end-to-end speedup in video generation latency on RTX5090 GPUs.
  • Quality Preservation: SLA matches or exceeds the quality of full attention and outperforms all sparse and linear baselines, even at much higher sparsity. Figure 4

    Figure 4: Video generation examples on Wan2.1 fine-tuned with full attention, linear attention, sparse attention, and SLA. SLA achieves 95% sparsity with lossless video quality.

    Figure 5

    Figure 5: Video examples using Wan2.1 fine-tuned with SLA and baselines. Only SLA and full attention produce high-quality, temporally consistent videos.

    Figure 6

    Figure 6: Attention kernel speed and end-to-end generation latency of SLA and baselines on Wan2.1-1.3B with RTX5090.

Ablation studies confirm that neither sparse nor linear attention alone can achieve the same trade-off between efficiency and quality. The fusion strategy and the use of softmax as the activation function in the linear component are both critical for optimal performance.

Theoretical and Practical Implications

The decomposition of attention weights into high-rank and low-rank components provides a principled explanation for the failure of pure linear or sparse attention in high-fidelity generative models. SLA's hybrid approach leverages this structure, enabling aggressive sparsification without sacrificing expressivity or quality.

Practically, SLA enables the deployment of large-scale video and image diffusion models on commodity hardware, reducing both inference and fine-tuning costs. The method is compatible with existing DiT architectures and requires only a few thousand fine-tuning steps to adapt pretrained models.

Theoretically, this work suggests that future efficient attention mechanisms should exploit the heterogeneous structure of attention weights, rather than relying on uniform approximations. The block-wise, dynamic partitioning strategy of SLA is likely to be extensible to other domains, including long-context language modeling and multimodal generative models.

Future Directions

Potential avenues for further research include:

  • Adaptive Block Partitioning: Learning or dynamically adjusting block sizes and thresholds for critical/marginal/negotiable categories.
  • Integration with Quantization: Combining SLA with low-precision attention kernels for further efficiency gains.
  • Application to Other Modalities: Extending SLA to long-context LLMs, audio, and multimodal transformers.
  • Theoretical Analysis: Formalizing the relationship between attention weight distribution, rank, and generative quality.

Conclusion

SLA introduces a fine-tunable, hybrid sparse-linear attention mechanism that achieves substantial acceleration of diffusion transformers without compromising generation quality. By partitioning attention computation according to empirical importance and leveraging both sparse and linear approximations, SLA sets a new standard for efficient, scalable generative modeling in high-dimensional domains.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper introduces a new way to speed up “attention” in Diffusion Transformer (DiT) models, especially for making videos. Attention is a core part of Transformers but becomes very slow when sequences are long (like videos with many frames), because it normally checks every item against every other item. The authors propose SLA (Sparse–Linear Attention), a method that smartly combines two ideas—sparse attention and linear attention—to keep quality high while cutting the time and computation needed by attention.

Key Objectives

The paper asks a simple question: How can we make attention in video-generating Transformers much faster without hurting the quality of the videos?

To do that, the authors aim to:

  • Understand the structure of attention weights (which say how much one token should “pay attention” to another).
  • Use this understanding to split the attention work into parts that must be done exactly, parts that can be done faster with approximations, and parts that can be skipped.
  • Build a practical GPU implementation that works in both training and inference.
  • Show that this method keeps video quality while dramatically speeding up generation.

Methods and Approach

Think of attention like reading a huge textbook: in theory, you compare every sentence to every other sentence to figure out what’s important. That takes a long time. The paper breaks this problem into a simpler plan using a few key ideas.

  1. What is attention and why is it slow?
  • Standard attention compares all pairs of tokens, which takes time that grows like the square of the sequence length (O(N2)O(N^2)). For long videos, NN can be 10,000–100,000, so this is extremely slow.
  1. Two known shortcuts and their problems
  • Sparse attention: Only compute the most important comparisons and skip the tiny ones. This helps, but in practice you still end up computing a lot—often not sparse enough for big speed-ups.
  • Linear attention: Reformulate attention so its cost grows linearly (O(N)O(N)). Sounds great, but for video diffusion models, using linear attention alone usually hurts quality.
  1. A key observation about attention weights
  • Attention weights can be split into:
    • A small set of big, important weights that are complex (high-rank) and must be computed accurately.
    • A large set of small weights that are simple (low-rank) and can be handled with a faster method.
  • In everyday terms: a few parts really matter and are complicated; many parts matter less and follow simple patterns.
  1. The SLA plan: classify attention into three types To make this work, SLA first builds a smaller “preview map” of attention (by averaging tokens into blocks). Then it labels each block as:
  • Critical: the most important weights—compute them exactly with fast, optimized full attention.
  • Marginal: medium-importance weights—compute them using linear attention (cheap and fast).
  • Negligible: tiny weights—skip them entirely.

Because the marginal part uses linear attention and the negligible part is skipped, the overall amount of work drops drastically, while the critical part preserves accuracy where it matters most.

  1. Training and efficiency
  • SLA is fine-tuned: you swap the old attention for SLA and train the model for a small number of steps so it adapts.
  • The authors wrote a single fused GPU “kernel” that handles both sparse and linear parts together—this reduces overhead and speeds things up in both the forward and backward passes.

Analogy for “rank”: If a picture can be described with just a few simple shapes, it’s “low-rank.” If it needs lots of unique details, it’s “high-rank.” Most small attention weights look like simple, repeatable patterns (low-rank), while a few big weights need detailed, exact computation (high-rank).

Main Findings

The results are striking:

  • SLA cuts attention computation by about 95% and often delivers around a 20× reduction compared to full attention.
  • The attention kernel (the GPU code that does attention) runs about 13.7× faster.
  • End-to-end video generation speeds up by about 2.2× on the Wan2.1-1.3B model.
  • Most importantly, video quality stays essentially the same as full attention, even at very high sparsity (they skip or approximate most attention work).
  • Compared to other methods that only use sparse attention or only use linear attention, SLA is both faster and produces better video quality.

In practice, they found:

  • Only about 5% of blocks need exact attention (critical).
  • About 10% can be safely skipped (negligible).
  • The rest (marginal) are handled with linear attention, which is extremely cheap in video models (less than 0.5% of full attention cost).
  • A short fine-tuning (e.g., 2,000 steps) is enough to make the model adapt and keep quality high.

Implications and Impact

This research shows a practical way to make video generation models much faster without sacrificing quality. By focusing heavy computation on the few parts that truly matter and treating the rest with simple, fast methods, SLA:

  • Makes long-sequence attention more manageable.
  • Reduces costs for training and inference.
  • Helps deploy stronger video models on real hardware with lower latency.
  • Opens the door to larger, more complex models and longer videos without huge slowdowns.

In short, SLA is a smart balance: do the hard work only where it’s necessary, and use lighter methods elsewhere. This hybrid approach could influence future designs for efficient attention in not only video diffusion models but also other long-sequence tasks.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 59 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com