Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On The Open Prompt Challenge In Conditional Audio Generation (2311.00897v1)

Published 1 Nov 2023 in cs.SD, cs.CL, and eess.AS

Abstract: Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a blackbox'' and address the user prompt challenge with two key insights: (1) User prompts are generally under-specified, leading to a large alignment gap between user prompts and training prompts. (2) There is a distribution of audio descriptions for which TTA models are better at generating higher quality audio, which we refer to asaudionese''. To this end, we rewrite prompts with instruction-tuned models and propose utilizing text-audio alignment as feedback signals via margin ranking learning for audio improvements. On both objective and subjective human evaluations, we observed marked improvements in both text-audio alignment and music audio quality.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Ernie Chang (34 papers)
  2. Sidd Srinivasan (3 papers)
  3. Mahi Luthra (2 papers)
  4. Pin-Jie Lin (10 papers)
  5. Varun Nagaraja (9 papers)
  6. Forrest Iandola (23 papers)
  7. Zechun Liu (48 papers)
  8. Zhaoheng Ni (32 papers)
  9. Changsheng Zhao (17 papers)
  10. Yangyang Shi (54 papers)
  11. Vikas Chandra (75 papers)
Citations (2)