Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback (2411.01834v1)

Published 4 Nov 2024 in cs.CL and eess.AS

Abstract: While textless Spoken LLMs (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based LLMs in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Syllablelm: Learning coarse semantic units for speech language models. Preprint, arXiv:2410.04029.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  4. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48.
  5. Audiolm: a language modeling approach to audio generation. IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533.
  6. Enhancing zero-shot text-to-speech synthesis with human feedback. arXiv preprint arXiv:2406.00654.
  7. Speech-to-speech translation for a real-world unwritten language. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4969–4983.
  8. Cheng-Han Chiang and Hung-Yi Lee. 2023a. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631.
  9. Cheng-Han Chiang and Hung-yi Lee. 2023b. A closer look into using large language models for automatic evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8928–8942, Singapore. Association for Computational Linguistics.
  10. Toward joint language modeling for speech units and text. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6582–6593, Singapore. Association for Computational Linguistics.
  11. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  12. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
  13. Generalization ability of mos prediction networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8442–8446. IEEE.
  14. Santiago Cuervo and Ricard Marxer. 2024. Scaling properties of speech language models. arXiv preprint arXiv:2404.00685.
  15. Moshi: a speech-text foundation model for real-time dialogue. Technical report, Kyutai.
  16. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666.
  17. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36.
  18. The curious case of neural text degeneration. In International Conference on Learning Representations.
  19. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460.
  20. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  21. textless-lib: a library for textless spoken language processing.
  22. Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681.
  23. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033.
  24. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
  25. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  26. Baton: Aligning text-to-audio model with human preference feedback. arXiv preprint arXiv:2402.00744.
  27. Advancing large language models to capture varied speaking styles and respond properly in spoken conversations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6626–6642, Bangkok, Thailand. Association for Computational Linguistics.
  28. DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering. In Proc. Interspeech 2022, pages 5165–5169.
  29. On the utility of self-supervised models for prosody-related tasks. In Proc. IEEE SLT, pages 1104–1111.
  30. Guan-Ting Lin and Hung-yi Lee. 2024. Can llms understand the implication of emphasized sentences in dialogue? arXiv preprint arXiv:2406.11065.
  31. Paralinguistics-enhanced large language modeling of spoken dialogue. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10316–10320.
  32. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522.
  33. Voxtlm: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 13326–13330. IEEE.
  34. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. arXiv preprint arXiv:2404.09956.
  35. LSDSem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, Valencia, Spain. Association for Computational Linguistics.
  36. Spoken question answering and speech continuation using spectrogram-powered llm. In The Twelfth International Conference on Learning Representations.
  37. The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. In NeuRIPS Workshop on Self-Supervised Learning for Speech and Audio Processing.
  38. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266.
  39. Spirit-lm: Interleaved spoken and written language model. arXiv preprint arXiv:2402.05755.
  40. OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  41. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  42. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
  43. Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech 2021, pages 3615–3619.
  44. Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech 2021.
  45. MLS: A Large-Scale Multilingual Dataset for Speech Research. In Proc. Interspeech 2020, pages 2757–2761.
  46. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR.
  47. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  48. Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2416–2419.
  49. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  50. Gsqa: An end-to-end model for generative spoken question answering. In Interspeech 2024, pages 2970–2974.
  51. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  52. SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198.
  53. Improving Textless Spoken Language Understanding with Discrete Units as Intermediate Target. In Proc. INTERSPEECH 2023, pages 1503–1507.
  54. Zhifei Xie and Changqiao Wu. 2024. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725.
  55. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore. Association for Computational Linguistics.
  56. Speechalign: Aligning speech generation to human preferences. arXiv preprint arXiv:2404.05600.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Guan-Ting Lin (21 papers)
  2. Prashanth Gurunath Shivakumar (18 papers)
  3. Aditya Gourav (8 papers)
  4. Yile Gu (25 papers)
  5. Ankur Gandhe (30 papers)
  6. Hung-yi Lee (327 papers)
  7. Ivan Bulyko (23 papers)

Summary

An Examination of the Align-SLM Framework for Textless Spoken LLMs

The paper under examination, "Align-SLM: Textless Spoken LLMs with Reinforcement Learning from AI Feedback," proposes a novel framework aimed at improving the semantic performance of Spoken LLMs (SLMs). The authors identify a significant gap in semantic coherence and relevance between textless SLMs and their text-based counterparts, LLMs. The Align-SLM framework leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to solve this problem effectively.

Motivation and Approach

Textless SLMs have emerged as promising tools for end-to-end speech-to-speech modeling. However, the inherent complexity of modeling semantics without textual input has posed challenges. Traditional text-based LLMs outperform SLMs in maintaining semantic coherence across continuations, often exhibiting repetitive phrases and grammatical errors when applied to speech token prediction tasks. The paper posits that alternative optimization strategies could relieve some of these limitations.

Align-SLM addresses these through a Direct Preference Optimization (DPO) approach. A pre-trained SLM, particularly the open-sourced TWIST model, is used to generate multiple speech continuations from prompts. These continuations are evaluated based on semantic metrics using LLM-guided feedback, reducing dependency on costly human evaluations. The application of preference optimization frameworks traditionally applied to LLMs demonstrates the critical adaptive ability required to improve SLM semantics without text token integration.

Align-SLM integrates DPO with curriculum learning to iteratively optimize preference data criteria, enhancing model performance further. This framework uniquely balances between pure speech-to-speech modeling and semantic integrity preservation, creating a potentially faster and more inclusive approach, avoiding text-to-speech synthesized interventions.

Evaluation and Results

The Align-SLM framework is evaluated using several benchmarks: the ZeroSpeech 2021 for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and additional speech generation metrics like the GPT4-o score and human evaluation scores. These benchmarks comprehensively test the framework's ability to bridge the semantic gap observed in SLMs.

Experimental results reveal significant improvements over existing models. Notably, Align-SLM achieves state-of-the-art performance for SLMs in the ZeroSpeech 2021 sWUGGY and StoryCloze datasets, demonstrating marked advances in semantic understanding and speech generation. The preference optimization framework not only enhances the semantic fidelity but also improves lexical and syntactic markers like sBLIMP.

The work presents an innovative mechanism for utilizing automated semantic feedback to form effective preference data, thereby setting a precedent for speech models to employ LLM-guided feedback mechanisms. This process mitigates the costs and challenges associated with human feedback collection and introduces a scalable paradigm for SLM optimization.

Future Directions and Implications

The implications of this research are manifold. The framework establishes a solid foundation for future SLM advancements by leveraging reinforcement learning to train models for superior semantic continuity without relying on intermediate text prediction. Align-SLM's success in improving semantic content through a direct reinforcement learning approach encourages expansion into broader applications across diverse languages, especially those lacking comprehensive written resources.

Future research could consider evaluating Align-SLM's adaptability across more extensive datasets, assessing its compatibility with emerging speech models, and refining the semantic feedback loop to further align speech generation with nuanced human interactions.

The paper contributes to a progressive narrative in SLM research, exploring the theoretical and practical potentials of DPO application. As AI continues to evolve, the insights from this work could influence the development of real-time, inclusive spoken dialogue systems capable of supporting a broader spectrum of languages and dialects.

By addressing both the semantic and practical constraints current SLMs face, Align-SLM marks a pivotal step in evolving end-to-end speech modeling frameworks, helping align them closer to their text-based counterparts in performance and application.

X Twitter Logo Streamline Icon: https://streamlinehq.com