Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model (2304.13731v2)

Published 24 Apr 2023 in eess.AS, cs.AI, cs.CL, and cs.SD

Abstract: The immense scale of the recent LLMs (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many NLP tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.

PDF Abstract

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

The paper "Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model" presents an exploration of text-to-audio (TTA) generation leveraging an instruction-tuned LLM, specifically Flan-T5, combined with a latent diffusion model (LDM). This research shows improved performance over existing models, like AudioLDM, by introducing novel techniques in augmentation and model design, despite using a significantly smaller dataset.

Methodology

The methodology of this research employs Flan-T5 as the text encoder, taking advantage of its instruction-based fine-tuning to enhance text understanding capabilities in synthesizing audio descriptors. The research opts for a frozen text encoder, diverging from earlier methods requiring joint text-audio encoders or fine-tuning models, relying instead on the pre-trained model's innate ability to handle text inputs effectively.

A key component of this approach is the adaptation of a latent diffusion model, which constructs audio priors from standard Gaussian noise via reverse diffusion. The integration of Flan-T5 supports the inference process by using advanced text embeddings without requiring additional training layers, thus maintaining computational efficiency.

Augmentation Strategy

The researchers propose a distinctive audio mixing method, basing it on audio pressure levels rather than random selection strategies. This method is informed by human auditory perception, ensuring balanced representation of audio features when multiple sources are combined. Such a strategy aids in generating more robust training data for complex audio scenes, leading to improved model accuracy.

Results and Evaluation

The paper provides substantial evidence of the model's effectiveness through extensive experimentation on the AudioCaps dataset. The results demonstrate Tango's capacity to outperform AudioLDM and other existing models across various objective metrics, such as Frechet Audio Distance (FAD), KL divergence, and subjective evaluations. Notably, even with a dataset 63 times smaller than that used for AudioLDM, Tango achieved superior or comparable performance scores, highlighting the sample efficiency of the model.

Implications and Future Directions

The implications of this research extend to practical applications in media production, where rapid prototyping of audio from textual inputs can expedite creative processes. From a theoretical perspective, this work reinforces the potential of using instruction-tuned models to bridge modalities effectively.

Future research could explore expanding the dataset size to assess Tango's generalization capabilities across diverse audio types. Additionally, investigating other audio applications, such as super-resolution or inpainting, could provide further insights into the utility of instruction-tuned LLMs in audio tasks. Enhancing controllability in audio generation by fine-tuning Flan-T5 on more complex datasets might also improve text-to-audio alignment, a limitation acknowledged by the authors.

In conclusion, this paper contributes significantly to the understanding of leveraging LLMs for TTA tasks, offering a scalable and efficient approach that pushes the boundaries of current methodologies.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Deepanway Ghosal (33 papers)
Navonil Majumder (48 papers)
Ambuj Mehrish (15 papers)
Soujanya Poria (138 papers)

Citations (118)

View on Semantic Scholar

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model (2304.13731v2)

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

Methodology

Augmentation Strategy

Results and Evaluation

Implications and Future Directions

Related Papers

GitHub

YouTube