Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpeechAlign: Aligning Speech Generation to Human Preferences (2404.05600v1)

Published 8 Apr 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Speech LLMs have significantly advanced in generating realistic speech, with neural codec LLMs standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec LLMs, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech LLMs to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec LLM. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self-improvement of the speech LLM. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models. Code and models will be available at https://github.com/0nutation/SpeechGPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  2. Audiolm: a language modeling approach to audio generation, 2022.
  3. Soundstorm: Efficient parallel audio generation, 2023.
  4. Pheme: Efficient and conversational speech generation, 2024.
  5. Self-play fine-tuning converts weak language models to strong language models, 2024.
  6. Musicrl: Aligning music generation to human preferences, 2024.
  7. High fidelity neural audio compression, 2022.
  8. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  9. Aligning text-to-image models using human feedback, 2023.
  10. Groundinggpt:language enhanced multi-modal grounding model, 2024.
  11. Baton: Aligning text-to-audio model with human preference feedback, 2024.
  12. Visual instruction tuning, 2023a.
  13. Chain of hindsight aligns language models with feedback, 2023b.
  14. Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability, 2021.
  15. OpenAI. Gpt-4 technical report, 2023.
  16. Training language models to follow instructions with human feedback, 2022.
  17. Robust speech recognition via large-scale weak supervision, 2022.
  18. Direct preference optimization: Your language model is secretly a reward model, 2023.
  19. Audiopalm: A large language model that can speak and listen, 2023.
  20. Learning to summarize from human feedback, 2022.
  21. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  22. Neural codec language models are zero-shot text to speech synthesizers, 2023a.
  23. Inferaligner: Inference-time alignment for harmlessness through cross-model guidance, 2024.
  24. Viola: Unified codec language models for speech recognition, synthesis, and translation, 2023b.
  25. Advancing translation preference modeling with rlhf: A step towards cost-effective solution, 2024.
  26. Uniaudio: An audio foundation model toward universal audio generation, 2023.
  27. Self-play fine-tuning of diffusion models for text-to-image generation, 2024.
  28. Soundstream: An end-to-end neural audio codec, 2021.
  29. Anygpt: Unified multimodal llm with discrete sequence modeling, 2024.
  30. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  15757–15773, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.1055. URL https://aclanthology.org/2023.findings-emnlp.1055.
  31. DUB: Discrete unit back-translation for speech translation. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  7147–7164, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.447. URL https://aclanthology.org/2023.findings-acl.447.
  32. Speechgpt-gen: Scaling chain-of-information speech generation, 2024.
  33. The wisdom of hindsight makes language models better instruction followers, 2023c.
  34. Speechtokenizer: Unified speech tokenizer for speech large language models, 2023d.
  35. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
Citations (11)

Summary

  • The paper demonstrates that aligning speech language models with human preferences via iterative self-improvement dramatically enhances speech quality and fidelity.
  • It employs innovative preference data collection and optimization methods, including DPO, CoH, and BoN, to bridge the gap between golden and synthetic codec tokens.
  • Empirical results show reduced Word Error Rate and increased speaker similarity, confirming the effectiveness of SpeechAlign across diverse model sizes.

Aligning Speech LLMs with Human Preferences through Iterative Self-Improvement

Introduction

Speech LLMs (SLMs) have seen remarkable progress, particularly with the introduction of neural codec LLMs that utilize discrete speech representations for generating realistic speech. Despite their advancements, a crucial aspect often overlooked is the alignment of these models' outputs with human preferences—quality, naturalness, and expressiveness. The paper introduces SpeechAlign, an innovative iterative self-improvement strategy designed to address this gap. By constructing a preference codec dataset and employing preference optimization strategies, SpeechAlign iteratively refines speech LLMs to more closely align with human preferences, demonstrating its efficacy through both subjective and objective evaluations.

Analysis of Distribution Gaps in Codec LLMs

The paper begins by identifying a fundamental issue in current SLMs: the distribution gap between golden (actual) and synthetic (model-generated) codec tokens. This gap, stemming from training models with actual tokens while utilizing synthetic tokens during inference, significantly hampers model performance. Through a series of preliminary experiments, including T-SNE visualization and performance assessment using objective metrics like Word Error Rate (WER) and Speaker Similarity (SIM), the paper clearly delineates this distribution gap and its detrimental effects on speech quality and model fidelity.

SpeechAlign: Aligning Models with Human Preferences

Preference Data Collection

SpeechAlign approaches the alignment problem by first constructing a preference codec dataset that contrasts golden and synthetic codec tokens, circumventing the need for direct human annotation of the difficult-to-interpret numerical codec tokens. The paper outlines an ingenious method to ensure that this dataset aligns with human preferences through verification, converting tokens back to speech and conducting side-by-side comparisons.

Preference Optimization Strategies

Several optimization strategies are then explored to align the model’s outputs to human preference, including Direct Preference Optimization (DPO), Chain-of-Hindsight (CoH), and Best-of-N Sampling (BoN). The iterative nature of SpeechAlign permits continuous refinement, effectively improving speech generation capabilities as demonstrated by decreasing WER and increasing SIM scores across iterations.

Empirical Validation

The thorough empirical analysis presents a comparative paper of SpeechAlign iterations against a baseline model across two datasets, highlighting significant improvements in speech naturalness and expressiveness. SpeechAlign not only outperforms baseline models but also showcases robust generalization capabilities across unseen speakers. These results underscore the potential of iterative self-improvement and preference alignment in developing more human-centric speech LLMs.

Iterative Self-Improvement and Model Generalizability

The paper further investigates the impact of preference data size and model size on SpeechAlign's effectiveness. Results indicate that while increasing preference data size yields improvements up to a certain threshold, iterative optimization continues to offer benefits, underscoring the method's scalability and adaptability. Additionally, experiments with smaller models reveal SpeechAlign's capability to significantly improve speech quality, suggesting wide applicability across various model architectures.

Bridging the Distribution Gap

The final sections delve into the successful mitigation of the distribution gap, a central contribution of SpeechAlign. Visual representations post-optimization illustrate the alignment of golden and synthetic token distributions, affirming the approach's efficacy in reconciling training-inference disparities. This alignment is shown to directly correlate with enhanced speech generation, highlighting the importance of distribution gap mitigation in achieving model improvements.

Conclusion and Future Directions

SpeechAlign represents a significant step forward in the integration of human feedback into speech LLMs, addressing the previously neglected aspect of aligning model outputs with human preferences. While promising, the paper also discusses potential avenues for further research, including the exploration of more fine-grained human feedback and the extension of preference optimization to non-autoregressive models. SpeechAlign's iterative self-improvement framework paves the way for future advancements in speech technology, emphasizing the critical role of human preferences in shaping the development of more natural and expressive speech generation models.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets