Balanced SNR-Aware Distillation for Guided Text-to-Audio Generation
Abstract: Diffusion models have demonstrated promising results in text-to-audio generation tasks. However, their practical usability is hindered by slow sampling speeds, limiting their applicability in high-throughput scenarios. To address this challenge, progressive distillation methods have been effective in producing more compact and efficient models. Nevertheless, these methods encounter issues with unbalanced weights at both high and low noise levels, potentially impacting the quality of generated samples. In this paper, we propose the adaptation of the progressive distillation method to text-to-audio generation tasks and introduce the Balanced SNR-Aware~(BSA) method, an enhanced loss-weighting mechanism for diffusion distillation. The BSA method employs a balanced approach to weight the loss for both high and low noise levels. We evaluate our proposed method on the AudioCaps dataset and report experimental results showing superior performance during the reverse diffusion process compared to previous distillation methods with the same number of sampling steps. Furthermore, the BSA method allows for a significant reduction in sampling steps from 200 to 25, with minimal performance degradation when compared to the original teacher models.
- “Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,” arXiv preprint arXiv:2301.12661, 2023.
- “Mou^^u\rm\hat{u}over^ start_ARG roman_u end_ARGsai: Text-to-music generation with long-context latent diffusion,” arXiv preprint arXiv:2301.11757, 2023.
- “Audit: Audio editing by following instructions with latent diffusion models,” arXiv preprint arXiv:2304.00830, 2023.
- “Text-driven foley sound generation with latent diffusion model,” arXiv preprint arXiv:2306.10359, 2023.
- “Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10219–10228.
- “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- “Audioldm: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503, 2023.
- “Text-to-audio generation using instruction-tuned llm and latent diffusion model,” arXiv preprint arXiv:2304.13731, 2023.
- “Progressive distillation for fast sampling of diffusion models,” arXiv preprint arXiv:2202.00512, 2022.
- “Efficient diffusion training via min-snr weighting strategy,” arXiv preprint arXiv:2303.09556, 2023.
- “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
- “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- “Parallel wavenet: Fast high-fidelity speech synthesis,” in International conference on machine learning. PMLR, 2018, pp. 3918–3926.
- “On distillation of guided diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14297–14306.
- “Muse: Text-to-image generation via masked generative transformers,” arXiv preprint arXiv:2301.00704, 2023.
- “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020.
- “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
- “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
- “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms.,” in INTERSPEECH, 2019, pp. 2350–2354.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.