On The Open Prompt Challenge In Conditional Audio Generation (2311.00897v1)
Abstract: Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a blackbox'' and address the user prompt challenge with two key insights: (1) User prompts are generally under-specified, leading to a large alignment gap between user prompts and training prompts. (2) There is a distribution of audio descriptions for which TTA models are better at generating higher quality audio, which we refer to as
audionese''. To this end, we rewrite prompts with instruction-tuned models and propose utilizing text-audio alignment as feedback signals via margin ranking learning for audio improvements. On both objective and subjective human evaluations, we observed marked improvements in both text-audio alignment and music audio quality.
- Ernie Chang (33 papers)
- Sidd Srinivasan (3 papers)
- Mahi Luthra (2 papers)
- Pin-Jie Lin (10 papers)
- Varun Nagaraja (9 papers)
- Forrest Iandola (23 papers)
- Zechun Liu (48 papers)
- Zhaoheng Ni (32 papers)
- Changsheng Zhao (17 papers)
- Yangyang Shi (53 papers)
- Vikas Chandra (74 papers)