Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions (2407.04416v3)

Published 5 Jul 2024 in cs.SD, cs.MM, and eess.AS

Abstract: Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a LLM. The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks, showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Yi Yuan (54 papers)
  2. Dongya Jia (18 papers)
  3. Xiaobin Zhuang (9 papers)
  4. Yuanzhe Chen (19 papers)
  5. Zhengxi Liu (4 papers)
  6. Zhuo Chen (319 papers)
  7. Yuping Wang (56 papers)
  8. Yuxuan Wang (239 papers)
  9. Xubo Liu (66 papers)
  10. Mark D. Plumbley (114 papers)
  11. Wenwu Wang (148 papers)
  12. Xiyuan Kang (3 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com