LinVideo: Accelerated Video Diffusion

Updated 10 October 2025

LinVideo is a data-free, post-training framework that accelerates video diffusion by selectively replacing quadratic (softmax-based) attention with linear attention, achieving O(n) complexity.
The framework employs a selective transfer mechanism that adjusts each layer’s attention mode via learnable parameters to balance efficiency and video quality.
An anytime distribution matching (ADM) objective is used to align diffusion trajectories at every timestep, providing up to 2× overall speedup and significant latency reduction with minimal quality loss.

LinVideo is a data-free, post-training framework for accelerating video diffusion models by selectively and systematically replacing expensive quadratic (softmax-based) attention modules with linear attention in the transformer backbone, achieving O(n) complexity in both computational and memory cost. The key innovations lie in a progressive, task-driven “selective transfer” process for layer adaptation and the derivation of an anytime distribution matching (ADM) objective that ensures high-fidelity video sample quality is maintained. Empirical validation demonstrates the framework attains up to 2× end-to-end speedup (and up to ~16× with aggressive few-step distillation), with minimal loss of perceptual or structural video quality, suggesting that LinVideo can deliver efficient video generation at scale without costly retraining or substantial degradation.

1. Quadratic vs. Linear Attention in Video Diffusion Models

The core challenge LinVideo addresses is the quadratic scaling incurred by standard self-attention in long sequence video transformers, where the attention for a sequence of length $n$ incurs $\mathcal{O}(n^2)$ cost. In standard softmax-based self-attention, each output is computed as: $A(i) = \sum_{j=1}^n \left[\frac{\exp\left(\frac{q_i^T k_j}{\sqrt{d}}\right)}{\sum_{j=1}^n \exp\left(\frac{q_i^T k_j}{\sqrt{d}}\right)}\right] v_j$ where $q_i$ is the $i$ -th query, $k_j$ the keys, and $v_j$ the values.

To achieve linear scaling, LinVideo employs an associative kernel mapping $\phi(\cdot)$ for linear attention: $A(i) = \frac{ \phi(q_i) \left( \sum_{j=1}^n \phi(k_j)^T v_j \right) }{ \phi(q_i) \left( \sum_{j=1}^n \phi(k_j)^T \right) }$ where $\phi(x) = \text{softmax}(\hat{z}) \oplus \text{softmax}( -\hat{z})$ , with $\hat{z}$ a linearly projected version of $x$ .

Directly substituting linear attention for all layers is not feasible—shallow layers may tolerate linearization, but others introduce severe quality drops due to limited expressiveness and loss of long-range dependencies, especially critical in spatiotemporal modeling.

2. Selective Transfer: Automated Layer Linearization

The selective transfer mechanism frames the decision to replace a given attention layer as a binary classification for each module. For layer $l$ , a learnable scalar $r^{(l)} \in [0, 1]$ mixes conventional softmax attention and linear attention: $A(i) = r^{(l)} \cdot [\text{softmax}(QK^T)V] + (1 - r^{(l)}) \cdot [\text{linear-attn}(Q,K,V)]$ If after optimization $r^{(l)} > 0.5$ , the layer remains as quadratic softmax attention; otherwise, it is converted to linear attention.

Targeting a precise layer budget, a constraint loss is imposed: $L_\text{con} = \left( \sum_l \lceil r^{(l)} \rceil - \text{target} \right)^2$ and a regularization loss encourages binary values: $L_\text{reg} = \sum_l \left( 1 - |2 r^{(l)} - 1|^\alpha \right)$ A straight-through estimator (STE) enables hard decisions for $r^{(l)}$ during training, resulting in a smooth, progressive adaptation. Empirically, early/shallow layers are more replaceable, while linearizing the first layer often severely degrades quality.

3. Anytime Distribution Matching (ADM) Objective

The ADM objective overcomes two limitations of prior transfer objectives for attention replacement:

Conventional MSE or few-step distillation only aligns the final sample, often leading to temporal artifacts (flicker/jitter).
LinVideo’s ADM explicitly aligns the distributions at every sampling timestep $t$ , ensuring robust diffusion trajectory alignment.

For an original model $p_t$ and linearized model $q_t$ , the ADM loss is: $\mathcal{L}_\text{ADM} = \mathbb{E}_{\hat{x}_t \sim q_t} \left[ \log \frac{q_t(\hat{x}_t)}{p_t(\hat{x}_t)} \right]$ The gradient of $\mathcal{L}_\text{ADM}$ is efficiently estimated by the diffence of their rectified flow scores: $s_t(\hat{x}_t) - \hat{s}_t(\hat{x}_t) = -\frac{1-t}{t}\left( f_\theta(\hat{x}_t) - \hat{f}_\theta(\hat{x}_t) \right)$ where $f_\theta$ and $\hat{f}_\theta$ are the score functions of the original and linearized networks. This structure allows efficient and accurate objective optimization during the selective transfer process.

4. Quantitative Results and Performance Metrics

LinVideo achieves a 1.25–2.00× speedup in inference for video diffusion models such as Wan 1.3B, maintaining or surpassing baseline quality on multiple imaging and consistency metrics:

Imaging Quality (FVD, LPIPS)
Subject and Scene Consistency
Temporal Smoothness

Aggressive few-step distillation yields a 15.92× latency reduction with minimal drop in visual quality. The selective transfer scheduler—using the learnable $r^{(l)}$ parameters—outperforms manual or heuristic selection. ADM optimization recovers trajectory distribution matching more effectively than MSE or few-step objectives.

5. Architectural and Practical Implications

LinVideo’s post-training, data-free character means it can retrofit existing video diffusion models without retraining on large-scale video data. This makes O(n) inference practical even for long-duration or high-resolution video synthesis.

Because it operates orthogonally to other acceleration techniques (e.g., sparse attention), LinVideo can be composed with them to further enhance efficiency. The method is particularly suitable for real-time or low-latency applications, video editing, or large-scale generative deployments where computational cost is a critical bottleneck.

The approach:

Dramatically reduces inference time and memory consumption
Preserves (and sometimes improves) visual and temporal quality
Automates the linearization strategy, maximizing efficiency–quality trade-offs in a model- and data-adaptive manner

6. Limitations and Scope

Not all attention layers are equally replaceable; certain layers must retain quadratic attention to preserve global modeling capacity. The method is designed for post-training application and does not modify the original network’s training (weights are kept fixed apart from linear attention module adaptation and $r^{(l)}$ scheduling). While LinVideo recovers original performance across a wide range of benchmarks in the data, some pathological cases may still require custom tuning or hybrid strategies for best results.

7. Future Directions

A plausible implication is that LinVideo’s selective transfer and ADM principles could inspire analogous post-training efficiency solutions in multi-modal or spatiotemporally intricate models beyond video diffusion. Extensions to hybrid attention architectures, further integration with compression or sparse computation, and adaptation to streaming or online generation are immediate areas for investigation.

In conclusion, LinVideo provides a rigorous, automation-friendly framework for accelerating video diffusion models via selective attention linearization, achieving significant speedups with negligible sacrifice in quality and without retraining requirements (Huang et al., 9 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation (2025)

Follow Topic

Get notified by email when new papers are published related to LinVideo.