Enhance-A-Video: Temporal Coherence Boost
- The paper introduces a novel, training-free method that re-weights cross-frame attention in DiT-based video diffusion models to improve temporal consistency and visual quality.
- It leverages a scalar temperature parameter to amplify under-utilized cross-frame attention while clipping to preserve intra-frame integrity.
- Empirical evaluations demonstrate improved benchmark scores and human preference ratings, with minimal computational overhead and broad compatibility.
Enhance-A-Video refers to a category of video enhancement methods unified by the objective of improving generated or compressed videos’ perceptual quality, consistency, and robustness via plug-and-play, context-aware adaptation or attention-based post-processing. Notably, the term “Enhance-A-Video” corresponds directly to a training-free method for DiT-based video diffusion models that re-weights cross-frame temporal attention during inference to boost temporal consistency and visual quality, without any retraining or fine-tuning (Luo et al., 11 Feb 2025). The approach and its empirical evaluation illuminate broader principles for post-hoc enhancement of generative and compressed video streams.
1. Motivation and Background
State-of-the-art video generation frameworks employing Diffusion Transformers (DiT), such as HunyuanVideo, Cog VideoX, LTX-Video, and Open-Sora, rely on interleaved spatial-temporal self-attention layers to synthesize coherent video clips across frames. In practice, these models exhibit an excessive dominance of intra-frame (diagonal) attention components—attenuating cross-frame dependencies critical for motion smoothness and visual coherence. This imbalance often manifests as temporal flickering or object artifacts in generated sequences. Enhance-A-Video directly addresses this by amplifying under-utilized cross-frame attention mass at inference.
2. Core Algorithmic Principle: Cross-Frame Attention Re-weighting
For each DiT block’s temporal attention map , Enhance-A-Video introduces a scalar temperature parameter that re-scales the average non-diagonal (“cross-frame”) attention, while clipping the scale factor to a minimum of 1 to avoid attenuating cross-frame signals. The Cross-Frame Intensity (CFI) is computed as
The enhanced intensity is
The attention output in residual form is replaced by
where is the block’s input hidden state, and is the self-attention output.
Distinctively, Enhance-A-Video (i) acts only on attention outputs—not on the softmax attention weights, (ii) is triggered by the actual off-diagonal (cross-frame) mass at each block, and (iii) preserves intra-frame structure by clipping.
3. Model Integration and Compatibility
Enhance-A-Video resides entirely within the inference-time forward pass and thus does not require any retraining, fine-tuning, or modification of pretrained weights/buffers. It is broadly compatible with DiT architectures that use either full 3D attention (as in HunyuanVideo, Cog VideoX, LTX-Video) or decomposed spatial-temporal schemes (Open-Sora, Open-Sora-Plan) (Luo et al., 11 Feb 2025). For 3D attention, the method isolates per-spatial-token blocks, computes and applies the CFI enhancement analogously. Memory and computational cost is negligible (0.8–2.1% overhead), as only a single scalar multiplication per block is added.
4. Empirical Validation
The framework has been validated across multiple video diffusion models and datasets using both human preference studies and the VBench suite:
| Model | VBench Score (orig) | VBench Score (Enhanced) |
|---|---|---|
| Cog VideoX | 77.27 | 77.34 |
| Open-Sora | 79.04 | 79.16 |
| LTX-Video | 71.93 | 72.04 |
Human studies (n=110) revealed majority preferences for Enhance-A-Video versions on temporal consistency and overall quality, with more than 60% favoring the enhanced output on these criteria. Ablations demonstrate that –$1.2$ achieves the optimal trade-off between coherence and sharpness, and that omitting the clipping step introduces visible blur/artifacts.
5. Application Examples and Qualitative Improvements
Enhance-A-Video produces pronounced qualitative gains in exemplar scenarios:
- HunyuanVideo: “Antique car” prompt now drives consistent forward motion; baseball players lose duplicated limbs and head flicker.
- Cog VideoX: The “balloon full of water” prompt achieves correct object retention rather than disassociated motion artifacts.
- LTX-Video: Enhanced texture sharpness is observed in snow peaks, improved water definition in canyon scenes.
- Open-Sora variants: Smoother object motion and refined spatial details (flowers, waterfalls, cakes).
These results generalize across both prompt-driven and arbitrary-motion scenarios, supporting the technique’s broad utility in text-to-video synthesis pipelines.
6. Implementation Details and Limitations
Key implementation hyperparameters include:
- : Typically set in [1.0, 1.3]; best results at 1.15–1.2.
- Clipping: Strict enforcement of to avoid pathological smoothing.
- No modification to attention masks or positional encodings.
- Plug-and-play activation at inference, requiring only minor engineering effort.
A potential limitation is the method’s reliance on DiT-compatible architectures with explicit temporal attention. Extremely low cross-frame mass or uninformative off-diagonal patterns may limit the gains achievable. The method does not affect model generation diversity beyond its effect on temporal correlation.
7. Impact and Prospects
The introduction of Enhance-A-Video demonstrates that model-level post-processing via attention-mass re-weighting can robustly and efficiently repair temporal incoherence and visual artifacts in state-of-the-art video synthesis. As the approach is training-free, plug-and-play, and compatible with a broad spectrum of pretrained DiT models, it sets a practical precedent for future work in generative video enhancement, either via more nuanced attention metrics or adaptive conditioning schemes (Luo et al., 11 Feb 2025). Its empirical success in human and benchmark evaluations substantiates the broader claim that cross-frame attention is an underutilized leverage point in current generative architectures.
In summary, Enhance-A-Video is an inference-time method that rebalances the temporal self-attention structure of DiT-based diffusion video models by amplifying cross-frame aggregation using a clipped temperature parameter. The procedure produces reliably improved temporal coherence and visual quality, with minimal computational overhead and broad implementation compatibility (Luo et al., 11 Feb 2025).