Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
89 tokens/sec
Gemini 2.5 Pro Premium
41 tokens/sec
GPT-5 Medium
23 tokens/sec
GPT-5 High Premium
19 tokens/sec
GPT-4o
96 tokens/sec
DeepSeek R1 via Azure Premium
88 tokens/sec
GPT OSS 120B via Groq Premium
467 tokens/sec
Kimi K2 via Groq Premium
197 tokens/sec
2000 character limit reached

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation (2405.15338v1)

Published 24 May 2024 in cs.SD and eess.AS

Abstract: We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{https://XinleiNIU.github.io/demo-SoundLoCD/}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  2. “Vector quantized diffusion model for text-to-image synthesis,” in CVPR, 2022.
  3. “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
  4. “Audiogen: Textually guided audio generation,” in ICLR, 2023.
  5. “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  6. “Real-time speech frequency bandwidth extension,” in ICASSP 2021. IEEE, 2021, pp. 691–695.
  7. “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, 2020.
  8. “Audioldm: Text-to-audio generation with latent diffusion models,” ICML, 2023.
  9. “Text-to-audio generation using instruction-tuned llm and latent diffusion model,” arXiv preprint arXiv:2304.13731, 2023.
  10. “Text-driven foley sound generation with latent diffusion model,” arXiv preprint, 2023.
  11. “Score-based generative modeling through stochastic differential equations,” 9th International Conference on Learning Representations, ICLR, 2021.
  12. “Denoising diffusion probabilistic models,” in NeurIPS, 2020, pp. 6840–6851.
  13. “Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,” International Conference on Machine Learning, 2023.
  14. “Make-an-audio 2: Temporal-enhanced text-to-audio generation,” arXiv preprint arXiv:2305.18474, 2023.
  15. “Neural discrete representation learning,” in NeurIPS, 2017.
  16. “Diffusion models beat gans on image synthesis,” NeurIPS 2021, vol. 34, pp. 8780–8794, 2021.
  17. “Discrete contrastive diffusion for cross-modal music and image generation,” in ICLR, 2022.
  18. “Lora: Low-rank adaptation of large language models,” The Tenth International Conference on Learning Representations, ICLR, 2022.
  19. “Denoising diffusion implicit models,” 9th ICLR, 2021.
  20. “Audiocaps: Generating captions for audios in the wild,” in NAACL-HLT, 2019.
  21. Karol J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia, 2015.
  22. “Melgan: Generative adversarial networks for conditional waveform synthesis,” in NeurIPS, 2019.
  23. “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023. IEEE, 2023, pp. 1–5.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.