Analysis of Attention in Video Diffusion Transformers (2504.10317v1)

Published 14 Apr 2025 in cs.CV

Abstract: We conduct an in-depth analysis of attention in video diffusion transformers (VDiTs) and report a number of novel findings. We identify three key properties of attention in VDiTs: Structure, Sparsity, and Sinks. Structure: We observe that attention patterns across different VDiTs exhibit similar structure across different prompts, and that we can make use of the similarity of attention patterns to unlock video editing via self-attention map transfer. Sparse: We study attention sparsity in VDiTs, finding that proposed sparsity methods do not work for all VDiTs, because some layers that are seemingly sparse cannot be sparsified. Sinks: We make the first study of attention sinks in VDiTs, comparing and contrasting them to attention sinks in LLMs. We propose a number of future directions that can make use of our insights to improve the efficiency-quality Pareto frontier for VDiTs.

Summary

The paper reveals that VDiTs exhibit structured attention patterns driven by spatial-temporal locality, enabling zero-shot video editing through attention transfer.
The paper finds that while most layers tolerate significant sparsity, a few sensitive layers require precise temperature tuning to maintain video quality.
The paper identifies notable attention sinks in later layers and demonstrates that targeted retraining can eliminate these sinks and improve efficient sparsity implementation.

This paper analyzes the behavior of attention mechanisms within Video Diffusion Transformers (VDiTs), large models used for text-to-video generation. VDiTs treat video frames flattened across space and time as a single long sequence, making self-attention computationally dominant and expensive. The analysis identifies three key properties of attention in these models: Structure, Sparsity, and Sinks.

Structured Attention

Attention maps in VDiTs exhibit consistent structures driven by spatial-temporal locality. Patches primarily attend to their spatial neighbors (forming diagonal patterns in the attention matrix) and temporal neighbors (forming off-diagonal stripes whose strength decreases with temporal distance). This structure is observed across different VDiT models (Mochi, HunyuanVideo, Wan2.1).

Practical Application: Self-Attention Transfer for Video Editing

This structural consistency enables a novel form of zero-shot video editing. By computing the attention maps during the generation of a video for a source prompt and then reusing these exact attention maps during the generation process for a target prompt (with the same initial noise seed), the structural and cinematic elements (like camera movement) from the source video can be imposed onto the target video.

Implementation: This technique was demonstrated using the Wan2.1-T2V-1.3B model, chosen because its architecture separates self-attention over visual tokens from cross-attention with text tokens. The process involves:
1. Generate video V1 with prompt P1, saving the self-attention maps A1 for each layer and timestep.
2. Generate video V2 with prompt P2 (and the same seed), but instead of computing self-attention maps A2, directly use the saved maps A1.
Results:
- Transferring between significantly different prompts (e.g., "car driving" source to "dog running" target) forces the target video's structure to mimic the source, largely ignoring the target prompt's content.
- Transferring between semantically similar prompts (e.g., "A car is driving" to "A red car is driving") enables fine-grained editing. The resulting video maintains the original cinematography but incorporates the change specified in the target prompt (the car becomes red).
- This works well for attribute changes (color) and background changes (adding "winter"). It struggles when the core object changes significantly (e.g., "car" to "truck").
Layer-Specific Roles: Analysis revealed that transferring attention from specific layers has distinct effects. For instance, in Wan2.1, transferring only layer 3's attention map significantly influenced the video's structure (camera angle), while transferring layer 0 or 19 produced results closer to the original source video structure. This suggests specific layers might control different generation aspects.

Finding: First Text Token Dominance

In models like Mochi and HunyuanVideo, visual tokens predominantly attend to the first text token (often a start-of-sequence token like <s>). Experiments showed that generating a video using only the features of this first token produced results very similar to using the full prompt, suggesting the initial token captures a summary sufficient for generation, potentially due to the bidirectional nature of the text encoder.

Attention Sparsity

While attention maps visually appear sparse, naively enforcing sparsity by masking out the lowest k% attention weights during generation (even for small k like 10%) causes significant degradation in video quality (e.g., pixelation).

Finding: Layer Sensitivity: This degradation isn't uniform. Layer-wise masking experiments on Mochi revealed that most layers tolerate 20% sparsity well, but two specific layers (44 and 45) are highly sensitive, causing quality drops even at 20% sparsity.
Result: When these two sensitive layers (44 and 45) were excluded from masking, the model could tolerate much higher sparsity (up to 70%) across all other layers while maintaining reasonable generation quality. This implies most of the attention is indeed sparse, but a few critical layers rely on low-magnitude attention weights.

Practical Application: Temperature Tuning

Applying temperature scaling ( $T$ ) directly to the attention score computation ( $softmax(QK^T / \sqrt{d_k} / T)$ ) modulates sparsity.

Implementation: Temperature was applied within the self-attention module, specifically testing its effect on single layers.
Results: Modifying the temperature of just one sensitive layer (e.g., layer 44 in Mochi) dramatically changed the output video. Temperatures > 1 often introduced artifacts, while lower temperatures (e.g., T=0.2) often produced high-quality results, sometimes improving upon the original (T=1.0). This suggests temperature can be a control knob for generation style or quality, especially when targeted at specific layers.

Attention Sinks

Similar to LLMs, some attention heads in VDiTs exhibit "attention sinks," where most query tokens attend heavily to a single key token position. This was observed consistently in Mochi (especially later layers), occasionally in Hunyuan, but not in Wan or CogVideoX.

Characteristics in VDiTs vs. LLMs:
- Low Contribution: Sink tokens in VDiTs have significantly smaller value norms compared to other tokens, suggesting they contribute little information, similar to sinks in LLMs. Skipping (zeroing the output of) sink heads during generation had minimal impact on Mochi's output quality, while skipping random non-sink heads caused degradation.
- Layer/Head Consistency: Unlike LLMs where sinks can appear anywhere, VDiT sinks (in Mochi) are concentrated in the final layers (44-47), especially the last two (46, 47), and consistently appear in the same specific heads across different prompts and denoising steps.
- Spatial Distribution: Sink token positions appear randomly scattered across the spatial dimensions within a frame, with no consistent pattern across prompts.
- Temporal Bias: Sink tokens are much more likely to appear in the first latent frame compared to later frames.

Mitigation via Retraining

The issues of sparsity-sensitive layers (44, 45) and sink-prone layers (46, 47) in Mochi were addressed by targeted retraining.

Implementation: The final four transformer blocks (layers 44-47) of a pre-trained Mochi variant were reinitialized and retrained on a video dataset, while the rest of the model parameters were kept frozen. Standard diffusion model training practices (AdamW, constant LR, EMA) were used.
Result: After retraining, these final layers exhibited sparser attention patterns similar to earlier layers, and attention sinks were eliminated. Importantly, these layers could now tolerate 20% attention masking without degrading output quality, improving the model's potential for efficient sparse attention implementations.

Practical Insights and Future Directions

Efficient Attention: The consistent structure suggests potential for fixed sparse attention patterns or initializing models with these structures to improve efficiency.
Temperature Control: Layer-specific temperature tuning offers a way to control generation without retraining. Learnable temperature parameters could aid fine-tuning.
Layer-Wise Sparsity: Sparsification techniques should consider layer sensitivity; uniform sparsity may be detrimental. Identifying and potentially preserving sensitive layers (or retraining them as shown) is crucial.
Attention Transfer Editing: Self-attention transfer is a promising direction for fine-grained video editing. Targeting specific layers known to control certain aspects (like camera angle) could enable more precise control.
Fine-grained Text Control: While focusing on the first text token is efficient, ensuring models attend appropriately to the entire prompt might be necessary for complex or nuanced generations.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (5)

Tweets

https://twitter.com/kwangmoo_yi/status/1912226648428249129

https://twitter.com/itsgeorgepi/status/1912229215380074819

https://twitter.com/arxivsanitybot/status/1912863759812763851