Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations (2402.12786v2)
Abstract: In spoken dialogue, even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles. The spoken styles, containing paralinguistic and prosodic information, mark the most significant difference between text and speech modality. When using text-only LLMs to model spoken dialogue, text-only LLMs cannot give different responses based on the speaking style of the current turn. In this paper, we focus on enabling LLMs to listen to the speaking styles and respond properly. Our goal is to teach the LLM that "even if the sentences are identical if they are spoken in different styles, their corresponding responses might be different". Since there is no suitable dataset for achieving this goal, we collect a speech-to-speech dataset, StyleTalk, with the following desired characteristics: when two current speeches have the same content but are spoken in different styles, their responses will be different. To teach LLMs to understand and respond properly to the speaking styles, we propose the Spoken-LLM framework that can model the linguistic content and the speaking styles. We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles. Based on extensive experiments, we show that Spoken-LLM outperforms text-only baselines and prior speech LLMs methods.
- Efficient self-supervised learning with contextualized target representations for vision, speech and language. In International Conference on Machine Learning, pages 1416–1429. PMLR.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
- Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359.
- Towards multimodal sarcasm detection (an _obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4619–4629.
- A large scale speech sentiment corpus. In Proc. LREC, pages 6549–6555.
- Toward joint language modeling for speech units and text. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6582–6593.
- Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
- Pengi: An audio language model for audio tasks. arXiv preprint arXiv:2305.11834.
- Towards asr robust spoken language understanding through in-context learning with word confusion networks. arXiv preprint arXiv:2401.02921.
- Meisd: A multimodal multi-label emotion, intensity and sentiment dialogue dataset for emotion recognition and sentiment analysis in conversations. In Proceedings of the 28th international conference on computational linguistics, pages 4441–4453.
- Joint audio and speech understanding. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
- Listen, think, and understand. arXiv preprint arXiv:2305.10790.
- Textually pretrained speech language models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Mutian He and Philip N. Garner. 2023. Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding. In Proc. INTERSPEECH 2023, pages 1109–1113.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995.
- Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681.
- High-fidelity audio compression with improved RVQGAN. In Thirty-seventh Conference on Neural Information Processing Systems.
- On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering. In Proc. Interspeech 2022, pages 5165–5169.
- On the utility of self-supervised models for prosody-related tasks. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1104–1111. IEEE.
- Paralinguistics-enhanced large language modeling of spoken dialogue. arXiv preprint arXiv:2312.15316.
- emotion2vec: Self-supervised pre-training for speech emotion representation. arXiv preprint arXiv:2312.15185.
- Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. arXiv preprint arXiv:2309.07937.
- The semaine corpus of emotionally coloured character interactions. In 2010 IEEE International Conference on Multimedia and Expo, pages 1079–1084. IEEE.
- Towards human-like spoken dialogue generation between ai agents from written dialogue. arXiv preprint arXiv:2310.01088.
- Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210.
- Lms with a voice: Spoken language modeling beyond speech tokens. arXiv preprint arXiv:2305.15255.
- Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266.
- OpenAI. 2023. Gpt-4 technical report.
- Cosmic: Data efficient instruction-tuning for speech in-context learning. arXiv preprint arXiv:2311.02248.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536.
- Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
- Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
- Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Superb-sg: Enhanced speech processing universal performance benchmark for semantic and generative capabilities. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8479–8492.
- Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- On decoder-only architecture for speech-to-text and large language model integration. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE.
- Audiodec: An open-source streaming high-fidelity neural audio codec. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- E-chat: Emotion-sensitive spoken dialogue system with large language models. arXiv preprint arXiv:2401.00475.
- Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765.
- SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198.
- End-to-end spoken conversational question answering: Task, dataset and model. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1219–1232.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507.
- SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore. Association for Computational Linguistics.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Texygen: A benchmarking platform for text generation models. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100.
- Guan-Ting Lin (21 papers)
- Cheng-Han Chiang (18 papers)
- Hung-yi Lee (327 papers)