Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue (2312.15316v2)
Abstract: LLMs have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM that utilizes text and speech modalities to better model the linguistic content and paralinguistic attributes of spoken dialogue. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multimodal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively.
- OpenAI, “Gpt-4 technical report,” 2023.
- “Is chatgpt equipped with emotional dialogue capabilities?,” arXiv preprint arXiv:2304.09582, 2023.
- “Towards multimodal sarcasm detection (an _obviously_ perfect paper),” in ACL 2019, 2019, pp. 4619–4629.
- “Dialogue-based neural learning to estimate the sentiment of a next upcoming utterance,” in Proc. of ICANN. Springer, 2017, pp. 477–485.
- “Towards a sentiment-aware conversational agent,” in Proceedings of the 22nd ACM International Conference on Intelligent Virtual Agents, 2022, pp. 1–3.
- Zhenqiao Song et al., “Generating responses with a specific emotion in dialog,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019, pp. 3685–3695, Association for Computational Linguistics.
- “Emosen: Generating sentiment and emotion controlled responses in a multimodal dialogue system,” IEEE Transactions on Affective Computing, vol. 13, no. 3, pp. 1555–1566, 2020.
- “Modelling context emotions using multi-task learning for emotion controlled dialog generation,” in Proc. of EMNLP, 2021, pp. 2919–2931.
- Ruibo Liu et al., “Modulating language models with emotions,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, Aug. 2021, pp. 4332–4339, Association for Computational Linguistics.
- “Dialogue act modeling for automatic tagging and recognition of conversational speech,” Computational linguistics, vol. 26, no. 3, pp. 339–373, 2000.
- “Prosodic and temporal features for language modeling for dialog,” Speech Communication, vol. 54, no. 2, pp. 161–174, 2012.
- “Emotion analysis for the upcoming response in open-domain human-computer conversation,” in Web and Big Data: APWeb-WAIM 2018 International Workshops: MWDA, BAH, KGMA, DMMOOC, DS, Macau, China, July 23–25, 2018, Revised Selected Papers 2. Springer, 2018, pp. 352–367.
- “User response and sentiment prediction for automatic dialogue evaluation,” arXiv preprint arXiv:2111.08808, 2021.
- “What is wrong with you?: Leveraging user sentiment for automatic dialog evaluation,” in Findings of ACL 2022, 2022, pp. 4194–4204.
- “Polise: Reinforcing politeness using user sentiment for customer care response generation,” in Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 6165–6175.
- “Acoustic and lexical sentiment analysis for customer service calls,” in ICASSP 2019. IEEE, 2019, pp. 5876–5880.
- “Speech emotion recognition based on fusion method,” Journal of Information Systems and Telecommunication, vol. 5, pp. 50–56, 2017.
- “Multi-clue fusion for emotion recognition in the wild,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, pp. 458–463.
- “Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis,” Neurocomputing, vol. 261, pp. 217–230, 2017.
- “On the utility of self-supervised models for prosody-related tasks,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1104–1111.
- “Incorporating end-to-end speech recognition models for sentiment analysis,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 7976–7982.
- “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
- “Text-free prosody-aware generative spoken language modeling,” in ACL 2022, 2022, pp. 8666–8681.
- Tu Anh Nguyen et al., “Generative spoken dialogue language modeling,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 250–266, 2023.
- Zalán Borsos et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
- “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv preprint arXiv:2305.11000, 2023.
- “Dialogpt: Large-scale generative pre-training for conversational response generation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020, pp. 270–278.
- “Speech sentiment analysis via pre-trained features from end-to-end asr models,” in ICASSP 2020. IEEE, 2020, pp. 7149–7153.
- “Leveraging Pre-Trained Language Model for Speech Sentiment Analysis,” in Proc. Interspeech 2021, 2021, pp. 3420–3424.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
- Wei-Ning Hsu et al., “Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training,” arXiv preprint arXiv:2104.01027, 2021.
- “Exploration of a self-supervised speech model: A study on emotional corpora,” in 2022 IEEE SLT. IEEE, 2023, pp. 868–875.
- “Switchboard: Telephone speech corpus for research and development,” in Acoustics, speech, and signal processing, ieee international conference on. IEEE Computer Society, 1992, vol. 1, pp. 517–520.
- “A large scale speech sentiment corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 6549–6555.
- Guan-Ting Lin (21 papers)
- Prashanth Gurunath Shivakumar (18 papers)
- Ankur Gandhe (30 papers)
- Chao-Han Huck Yang (89 papers)
- Yile Gu (25 papers)
- Shalini Ghosh (34 papers)
- Andreas Stolcke (57 papers)
- Hung-yi Lee (325 papers)
- Ivan Bulyko (23 papers)