SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation (2405.15338v1)
Abstract: We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{https://XinleiNIU.github.io/demo-SoundLoCD/}.
- “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- “Vector quantized diffusion model for text-to-image synthesis,” in CVPR, 2022.
- “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
- “Audiogen: Textually guided audio generation,” in ICLR, 2023.
- “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
- “Real-time speech frequency bandwidth extension,” in ICASSP 2021. IEEE, 2021, pp. 691–695.
- “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, 2020.
- “Audioldm: Text-to-audio generation with latent diffusion models,” ICML, 2023.
- “Text-to-audio generation using instruction-tuned llm and latent diffusion model,” arXiv preprint arXiv:2304.13731, 2023.
- “Text-driven foley sound generation with latent diffusion model,” arXiv preprint, 2023.
- “Score-based generative modeling through stochastic differential equations,” 9th International Conference on Learning Representations, ICLR, 2021.
- “Denoising diffusion probabilistic models,” in NeurIPS, 2020, pp. 6840–6851.
- “Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,” International Conference on Machine Learning, 2023.
- “Make-an-audio 2: Temporal-enhanced text-to-audio generation,” arXiv preprint arXiv:2305.18474, 2023.
- “Neural discrete representation learning,” in NeurIPS, 2017.
- “Diffusion models beat gans on image synthesis,” NeurIPS 2021, vol. 34, pp. 8780–8794, 2021.
- “Discrete contrastive diffusion for cross-modal music and image generation,” in ICLR, 2022.
- “Lora: Low-rank adaptation of large language models,” The Tenth International Conference on Learning Representations, ICLR, 2022.
- “Denoising diffusion implicit models,” 9th ICLR, 2021.
- “Audiocaps: Generating captions for audios in the wild,” in NAACL-HLT, 2019.
- Karol J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia, 2015.
- “Melgan: Generative adversarial networks for conditional waveform synthesis,” in NeurIPS, 2019.
- “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023. IEEE, 2023, pp. 1–5.