Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback (2411.01834v1)
Abstract: While textless Spoken LLMs (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based LLMs in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.
- Syllablelm: Learning coarse semantic units for speech language models. Preprint, arXiv:2410.04029.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48.
- Audiolm: a language modeling approach to audio generation. IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533.
- Enhancing zero-shot text-to-speech synthesis with human feedback. arXiv preprint arXiv:2406.00654.
- Speech-to-speech translation for a real-world unwritten language. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4969–4983.
- Cheng-Han Chiang and Hung-Yi Lee. 2023a. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631.
- Cheng-Han Chiang and Hung-yi Lee. 2023b. A closer look into using large language models for automatic evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8928–8942, Singapore. Association for Computational Linguistics.
- Toward joint language modeling for speech units and text. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6582–6593, Singapore. Association for Computational Linguistics.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
- Generalization ability of mos prediction networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8442–8446. IEEE.
- Santiago Cuervo and Ricard Marxer. 2024. Scaling properties of speech language models. arXiv preprint arXiv:2404.00685.
- Moshi: a speech-text foundation model for real-time dialogue. Technical report, Kyutai.
- Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666.
- Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36.
- The curious case of neural text degeneration. In International Conference on Learning Representations.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- textless-lib: a library for textless spoken language processing.
- Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033.
- On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Baton: Aligning text-to-audio model with human preference feedback. arXiv preprint arXiv:2402.00744.
- Advancing large language models to capture varied speaking styles and respond properly in spoken conversations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6626–6642, Bangkok, Thailand. Association for Computational Linguistics.
- DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering. In Proc. Interspeech 2022, pages 5165–5169.
- On the utility of self-supervised models for prosody-related tasks. In Proc. IEEE SLT, pages 1104–1111.
- Guan-Ting Lin and Hung-yi Lee. 2024. Can llms understand the implication of emphasized sentences in dialogue? arXiv preprint arXiv:2406.11065.
- Paralinguistics-enhanced large language modeling of spoken dialogue. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10316–10320.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522.
- Voxtlm: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 13326–13330. IEEE.
- Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. arXiv preprint arXiv:2404.09956.
- LSDSem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, Valencia, Spain. Association for Computational Linguistics.
- Spoken question answering and speech continuation using spectrogram-powered llm. In The Twelfth International Conference on Learning Representations.
- The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. In NeuRIPS Workshop on Self-Supervised Learning for Speech and Audio Processing.
- Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266.
- Spirit-lm: Interleaved spoken and written language model. arXiv preprint arXiv:2402.05755.
- OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
- Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech 2021, pages 3615–3619.
- Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech 2021.
- MLS: A Large-Scale Multilingual Dataset for Speech Research. In Proc. Interspeech 2020, pages 2757–2761.
- Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2416–2419.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Gsqa: An end-to-end model for generative spoken question answering. In Interspeech 2024, pages 2970–2974.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198.
- Improving Textless Spoken Language Understanding with Discrete Units as Intermediate Target. In Proc. INTERSPEECH 2023, pages 1503–1507.
- Zhifei Xie and Changqiao Wu. 2024. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725.
- SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore. Association for Computational Linguistics.
- Speechalign: Aligning speech generation to human preferences. arXiv preprint arXiv:2404.05600.
- Guan-Ting Lin (21 papers)
- Prashanth Gurunath Shivakumar (18 papers)
- Aditya Gourav (8 papers)
- Yile Gu (25 papers)
- Ankur Gandhe (30 papers)
- Hung-yi Lee (327 papers)
- Ivan Bulyko (23 papers)