SpeechAlign: Aligning Speech Generation to Human Preferences (2404.05600v1)
Abstract: Speech LLMs have significantly advanced in generating realistic speech, with neural codec LLMs standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec LLMs, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech LLMs to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec LLM. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self-improvement of the speech LLM. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models. Code and models will be available at https://github.com/0nutation/SpeechGPT.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Audiolm: a language modeling approach to audio generation, 2022.
- Soundstorm: Efficient parallel audio generation, 2023.
- Pheme: Efficient and conversational speech generation, 2024.
- Self-play fine-tuning converts weak language models to strong language models, 2024.
- Musicrl: Aligning music generation to human preferences, 2024.
- High fidelity neural audio compression, 2022.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Aligning text-to-image models using human feedback, 2023.
- Groundinggpt:language enhanced multi-modal grounding model, 2024.
- Baton: Aligning text-to-audio model with human preference feedback, 2024.
- Visual instruction tuning, 2023a.
- Chain of hindsight aligns language models with feedback, 2023b.
- Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Robust speech recognition via large-scale weak supervision, 2022.
- Direct preference optimization: Your language model is secretly a reward model, 2023.
- Audiopalm: A large language model that can speak and listen, 2023.
- Learning to summarize from human feedback, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Neural codec language models are zero-shot text to speech synthesizers, 2023a.
- Inferaligner: Inference-time alignment for harmlessness through cross-model guidance, 2024.
- Viola: Unified codec language models for speech recognition, synthesis, and translation, 2023b.
- Advancing translation preference modeling with rlhf: A step towards cost-effective solution, 2024.
- Uniaudio: An audio foundation model toward universal audio generation, 2023.
- Self-play fine-tuning of diffusion models for text-to-image generation, 2024.
- Soundstream: An end-to-end neural audio codec, 2021.
- Anygpt: Unified multimodal llm with discrete sequence modeling, 2024.
- SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 15757–15773, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.1055. URL https://aclanthology.org/2023.findings-emnlp.1055.
- DUB: Discrete unit back-translation for speech translation. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 7147–7164, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.447. URL https://aclanthology.org/2023.findings-acl.447.
- Speechgpt-gen: Scaling chain-of-information speech generation, 2024.
- The wisdom of hindsight makes language models better instruction followers, 2023c.
- Speechtokenizer: Unified speech tokenizer for speech large language models, 2023d.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.