Papers
Topics
Authors
Recent
Search
2000 character limit reached

Balanced SNR-Aware Distillation for Guided Text-to-Audio Generation

Published 25 Dec 2023 in cs.SD and eess.AS | (2312.15628v1)

Abstract: Diffusion models have demonstrated promising results in text-to-audio generation tasks. However, their practical usability is hindered by slow sampling speeds, limiting their applicability in high-throughput scenarios. To address this challenge, progressive distillation methods have been effective in producing more compact and efficient models. Nevertheless, these methods encounter issues with unbalanced weights at both high and low noise levels, potentially impacting the quality of generated samples. In this paper, we propose the adaptation of the progressive distillation method to text-to-audio generation tasks and introduce the Balanced SNR-Aware~(BSA) method, an enhanced loss-weighting mechanism for diffusion distillation. The BSA method employs a balanced approach to weight the loss for both high and low noise levels. We evaluate our proposed method on the AudioCaps dataset and report experimental results showing superior performance during the reverse diffusion process compared to previous distillation methods with the same number of sampling steps. Furthermore, the BSA method allows for a significant reduction in sampling steps from 200 to 25, with minimal performance degradation when compared to the original teacher models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. “Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,” arXiv preprint arXiv:2301.12661, 2023.
  2. “Mou^^u\rm\hat{u}over^ start_ARG roman_u end_ARGsai: Text-to-music generation with long-context latent diffusion,” arXiv preprint arXiv:2301.11757, 2023.
  3. “Audit: Audio editing by following instructions with latent diffusion models,” arXiv preprint arXiv:2304.00830, 2023.
  4. “Text-driven foley sound generation with latent diffusion model,” arXiv preprint arXiv:2306.10359, 2023.
  5. “Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10219–10228.
  6. “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  7. “Audioldm: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503, 2023.
  8. “Text-to-audio generation using instruction-tuned llm and latent diffusion model,” arXiv preprint arXiv:2304.13731, 2023.
  9. “Progressive distillation for fast sampling of diffusion models,” arXiv preprint arXiv:2202.00512, 2022.
  10. “Efficient diffusion training via min-snr weighting strategy,” arXiv preprint arXiv:2303.09556, 2023.
  11. “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  12. “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  13. “Parallel wavenet: Fast high-fidelity speech synthesis,” in International conference on machine learning. PMLR, 2018, pp. 3918–3926.
  14. “On distillation of guided diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14297–14306.
  15. “Muse: Text-to-image generation via masked generative transformers,” arXiv preprint arXiv:2301.00704, 2023.
  16. “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020.
  17. “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
  18. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  19. “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  20. “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
  21. “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms.,” in INTERSPEECH, 2019, pp. 2350–2354.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (4)

Collections

Sign up for free to add this paper to one or more collections.