SpeechAlign: Aligning Speech Generation to Human Preferences (2404.05600v1)

Published 8 Apr 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Speech LLMs have significantly advanced in generating realistic speech, with neural codec LLMs standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec LLMs, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech LLMs to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec LLM. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self-improvement of the speech LLM. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models. Code and models will be available at https://github.com/0nutation/SpeechGPT.

References (35)

Citations (11)

View on Semantic Scholar

Summary

The paper demonstrates that aligning speech language models with human preferences via iterative self-improvement dramatically enhances speech quality and fidelity.
It employs innovative preference data collection and optimization methods, including DPO, CoH, and BoN, to bridge the gap between golden and synthetic codec tokens.
Empirical results show reduced Word Error Rate and increased speaker similarity, confirming the effectiveness of SpeechAlign across diverse model sizes.

Aligning Speech LLMs with Human Preferences through Iterative Self-Improvement

Introduction

Speech LLMs (SLMs) have seen remarkable progress, particularly with the introduction of neural codec LLMs that utilize discrete speech representations for generating realistic speech. Despite their advancements, a crucial aspect often overlooked is the alignment of these models' outputs with human preferences—quality, naturalness, and expressiveness. The paper introduces SpeechAlign, an innovative iterative self-improvement strategy designed to address this gap. By constructing a preference codec dataset and employing preference optimization strategies, SpeechAlign iteratively refines speech LLMs to more closely align with human preferences, demonstrating its efficacy through both subjective and objective evaluations.

Analysis of Distribution Gaps in Codec LLMs

The paper begins by identifying a fundamental issue in current SLMs: the distribution gap between golden (actual) and synthetic (model-generated) codec tokens. This gap, stemming from training models with actual tokens while utilizing synthetic tokens during inference, significantly hampers model performance. Through a series of preliminary experiments, including T-SNE visualization and performance assessment using objective metrics like Word Error Rate (WER) and Speaker Similarity (SIM), the paper clearly delineates this distribution gap and its detrimental effects on speech quality and model fidelity.

SpeechAlign: Aligning Models with Human Preferences

Preference Data Collection

SpeechAlign approaches the alignment problem by first constructing a preference codec dataset that contrasts golden and synthetic codec tokens, circumventing the need for direct human annotation of the difficult-to-interpret numerical codec tokens. The paper outlines an ingenious method to ensure that this dataset aligns with human preferences through verification, converting tokens back to speech and conducting side-by-side comparisons.

Preference Optimization Strategies

Several optimization strategies are then explored to align the model’s outputs to human preference, including Direct Preference Optimization (DPO), Chain-of-Hindsight (CoH), and Best-of-N Sampling (BoN). The iterative nature of SpeechAlign permits continuous refinement, effectively improving speech generation capabilities as demonstrated by decreasing WER and increasing SIM scores across iterations.

Empirical Validation

The thorough empirical analysis presents a comparative paper of SpeechAlign iterations against a baseline model across two datasets, highlighting significant improvements in speech naturalness and expressiveness. SpeechAlign not only outperforms baseline models but also showcases robust generalization capabilities across unseen speakers. These results underscore the potential of iterative self-improvement and preference alignment in developing more human-centric speech LLMs.

Iterative Self-Improvement and Model Generalizability

The paper further investigates the impact of preference data size and model size on SpeechAlign's effectiveness. Results indicate that while increasing preference data size yields improvements up to a certain threshold, iterative optimization continues to offer benefits, underscoring the method's scalability and adaptability. Additionally, experiments with smaller models reveal SpeechAlign's capability to significantly improve speech quality, suggesting wide applicability across various model architectures.

Bridging the Distribution Gap

The final sections delve into the successful mitigation of the distribution gap, a central contribution of SpeechAlign. Visual representations post-optimization illustrate the alignment of golden and synthetic token distributions, affirming the approach's efficacy in reconciling training-inference disparities. This alignment is shown to directly correlate with enhanced speech generation, highlighting the importance of distribution gap mitigation in achieving model improvements.

Conclusion and Future Directions

SpeechAlign represents a significant step forward in the integration of human feedback into speech LLMs, addressing the previously neglected aspect of aligning model outputs with human preferences. While promising, the paper also discusses potential avenues for further research, including the exploration of more fine-grained human feedback and the extension of preference optimization to non-autoregressive models. SpeechAlign's iterative self-improvement framework paves the way for future advancements in speech technology, emphasizing the critical role of human preferences in shaping the development of more natural and expressive speech generation models.

PDF Markdown