Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model (2505.17561v1)

Published 23 May 2025 in cs.CV and cs.AI

Abstract: The choice of initial noise significantly affects the quality and prompt alignment of video diffusion models, where different noise seeds for the same prompt can lead to drastically different generations. While recent methods rely on externally designed priors such as frequency filters or inter-frame smoothing, they often overlook internal model signals that indicate which noise seeds are inherently preferable. To address this, we propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty. At its core is BANSA (Bayesian Active Noise Selection via Attention), an acquisition function that measures entropy disagreement across multiple stochastic attention samples to estimate model confidence and consistency. For efficient inference-time deployment, we introduce a Bernoulli-masked approximation of BANSA that enables score estimation using a single diffusion step and a subset of attention layers. Experiments on CogVideoX-2B and 5B demonstrate that ANSE improves video quality and temporal coherence with only an 8% and 13% increase in inference time, respectively, providing a principled and generalizable approach to noise selection in video diffusion. See our project page: https://anse-project.github.io/anse-project/

Summary

Bayesian Active Noise Selection in Video Diffusion Models

The paper, "Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model" by Kwanyoung Kim and Sanghyun Kim, presents a novel approach to noise selection in video diffusion models. The research introduces Active Noise Selection for Generation (ANSE), leveraging Bayesian uncertainty principles integrated with model-aware attention mechanisms to optimize noise initialization, thus improving video quality and temporal coherence in text-to-video (T2V) diffusion processes.

Overview

Diffusion models have rapidly advanced in generative tasks like T2V generation, extending Text-to-Image (T2I) model frameworks with temporal components to achieve coherent video sequences. The authors emphasize the critical role of noise initialization during inference, noting how different noise seeds for identical prompts can yield distinctly varied generations. Recent approaches using external priors (e.g., frequency filters or inter-frame smoothing) are deemed computationally expensive and overlook internal model signals, which might better indicate suitable noise seeds.

The authors propose ANSE, which includes Bayesian Active Noise Selection via Attention (BANSA), to select optimal noise seeds based on calculated model confidence. BANSA estimates attention-based uncertainty through entropy disagreement across stochastic attention samples. Furthermore, they introduce a Bernoulli-masked approximation of BANSA to ensure efficient score estimation within a single diffusion step, enhancing the practical deployment of ANSE.

Key Results

Extensive experiments on CogVideoX-2B and 5B models demonstrate that ANSE leads to marked improvements in video quality and temporal coherence with marginal increases in inference time (8% for CogVideoX-2B and 13% for CogVideoX-5B). The quantitative data in VBench evaluations affirms that ANSE effectively enhances both perceptual fidelity and semantic alignment, superior to vanilla models without ANSE intervention.

Implications

The implications of this research are both practical and theoretical:

Practical Implications: The reduction in computational cost and improvement in video quality can facilitate more efficient and reliable T2V generation for practical applications. ANSE's model-aware approach to noise selection during inference circumvents the need for multiple external priors that are computationally expensive.
Theoretical Implications: Integrating Bayesian uncertainty into generative modeling workflows introduces a novel method to assess and leverage internal model signals, potentially informing future research on generative models' optimization and performance scalability.

Speculation on Future Developments

This paper suggests pathways for significant enhancement in generative modeling by prioritizing internal model introspection over external manipulations. Future developments could explore deeper integration of attention-based uncertainty measures across diverse domains of generative modeling, including large-scale LLM deployments where inference-time efficiencies could similarly transform output reliability and coherence. Additionally, combining ANSE with other noise optimization strategies might yield synergistic benefits in complex model architectures.

Conclusion

Overall, the paper contributes valuable insights into generative model refinement, offering a scalable, efficient, and internally guided approach to improve the generative quality of videos. As the field of AI continues to grapple with balancing computational cost with output fidelity, the strategies discussed in this research may prove pivotal for evolving generative model capabilities.

Related Papers

GitHub

Model Already Knows the Best Noise

Tweets

https://twitter.com/kwangmoo_yi/status/1927099206281441635

https://twitter.com/_akhaliq/status/1927017104361861323

https://twitter.com/NagaSaiAbhinay/status/1927071325023027266