BATON: Aligning Text-to-Audio Model with Human Preference Feedback (2402.00744v1)

Published 1 Feb 2024 in cs.SD, cs.CL, and eess.AS

Abstract: With the development of AI-Generated Content (AIGC), text-to-audio models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, a framework designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference.

PDF Abstract

Aligning Text-to-Audio Model with Human Preference Feedback: An In-Depth Overview of BATON

The paper "BATON: Aligning Text-to-Audio Model with Human Preference Feedback" addresses a pressing challenge in AI-generated content (AIGC) with particular emphasis on improving the alignment of text-to-audio (TTA) models with human preferences. The framework proposed, BATON, explores the incorporation of human feedback to bridge the gap between generated audio and its textual prompt, presenting a novel perspective in leveraging human annotations to refine TTA models. This essay explores the methodology, experimental outcomes, and broader implications of this research.

Methodology

Dataset Construction:

BATON's approach commences with the construction of a robust dataset designed to evaluate the integrity and temporal relationships within audio generation. A dataset of 4.8K text-audio pairs was created with 200 audio event categories. These pairs are anchored in two key tasks: integrity and temporal relationships. Human annotators then provided binary feedback, resulting in a comprehensive dataset with 2.7K annotated samples. This meticulous process ensures that the data reflects genuine human preferences.

Audio Reward Model:

The reward model is a cornerstone of BATON, trained to predict human preference based on text-audio pairs using binary cross-entropy loss. The encoders from CLAP (Contrastive Language-Audio Pretraining) were employed to extract embeddings, which were then processed through MLP layers to yield a reward score. This model ostensibly emulates human judgments, serving as a crucial modulator in the fine-tuning phase.

Fine-Tuning:

The framework incorporates the reward model into the training objective of an off-the-shelf TTA model, TANGO, by modifying the generative distribution to factor in human preference feedback. The traditional loss function is augmented with a reward-weighted term, ensuring that the model gravitates towards generating audio that aligns with the annotated human preference. The inclusion of the pretrain loss acts as a regularization mechanism, mitigating the risk of overfitting to the relatively small human-annotated dataset.

Experimental Results

BATON was benchmarked against baseline models on two tasks: integrity and temporal relationships, using various metrics like FD, FAD, IS, KL, and S_CLAP. Additionally, subjective evaluations (MOS-Q and MOS-F) were undertaken.

Key Findings:

Integrity Task: BATON achieved a significant enhancement in the CLAP score by 2.3% over the baseline TANGO model. Subjectively, MOS-Q scores indicated a notable improvement in perceived audio quality.
Temporal Task: A 6.0% increase in the CLAP score over TANGO underscores the efficacy of human feedback in managing more complex sequential relationships in audio generation. The MOS-F scores reflected substantial improvements in alignment with human perception.

The comparative analysis with models like AudioLDM-L and AudioLDM2-L further highlighted BATON's superior performance across most metrics.

Implications and Future Developments

Practical Applications:

The BATON framework can be instrumental in enhancing user experiences in various applications demanding high-fidelity TTA generation, such as virtual assistants, audio content creation, and multimedia applications. The alignment with human preferences suggests an avenue towards more intuitive and user-friendly AI systems.

Theoretical Contributions:

From a theoretical standpoint, BATON introduces a structured paradigm to integrate human feedback into generative models effectively. The application of a reward model to modulate fine-tuning presents a scalable method to improve generative outputs without extensive manual intervention.

Future Directions:

Future research could explore extending BATON's methodology to other modalities such as text-to-image or text-to-video generation. Additionally, incorporating online reinforcement learning mechanisms to continuously adapt the model to evolving human preferences could further enhance the relevance and applicability of this approach. There is potential for exploring more complex feedback types beyond binary annotations to capture nuanced preferences in audio characteristics.

Conclusion

The BATON framework represents a significant advancement in TTA models, demonstrating how human preference feedback can be systematically incorporated to fine-tune generative models. The robust methodological design and compelling experimental results underscore its potential to drive improvements in AIGC systems. The broader implications and future extensions of this research indicate promising directions for further integrating human feedback into generative AI, pushing the boundaries of alignment and fidelity in synthesized content.