2000 character limit reached
Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion (2401.14717v1)
Published 26 Jan 2024 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS
Abstract: We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a LLM. Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.
- Matthew B. Hoy, “Alexa, Siri, Cortana, and more: an introduction to voice assistants,” Medical reference services quarterly, vol. 37, no. 1, pp. 81–88, 2018.
- “Virtual agents as daily assistants for elderly or cognitively impaired people: Studies on acceptance and interaction feasibility,” in Intelligent Virtual Agents, 2013, pp. 79–91.
- “Study of a home robot: Jibo,” International journal of engineering research and technology, vol. 3, no. 10, pp. 490–493, 2014.
- “Ten challenges in highly-interactive dialog system.,” in AAAI Spring Symposium, 2015.
- “Evaluation of real-time deep learning turn-taking models for multiple dialogue scenarios,” in Proc. ACM ICMI, 2018, pp. 78–86.
- “Turn-taking prediction for natural conversational speech,” in Proc. Interspeech, 2022, pp. 1821–1825.
- “Turn-taking, feedback and joint attention in situated human-robot interaction,” Speech Communication, vol. 65, pp. 50–66, 2014.
- “Dialog prediction for a general model of turn-taking,” in Proc. Interspeech, 2010, pp. 2662–2665.
- “Improving end-of-turn detection in spoken dialogues by detecting speaker intentions as a secondary task,” in Proc. ICASSP, 2018, pp. 6159–6163.
- “Voice activity projection: Self-supervised learning of turn-taking events,” in Proc. Interspeech, 2022, pp. 5190–5194.
- Gabriel Skantze, “Turn-taking in conversational systems and human-robot interaction: a review,” Computer Speech & Language, vol. 67, pp. 101178, 2021.
- “Using neural networks for data-driven backchannel prediction: A survey on input features and training techniques,” in Human-Computer Interaction: Interaction Technologies, 2015, pp. 329–340.
- “Oh, jeez! or uh-huh? a listener-aware backchannel predictor on ASR transcriptions,” in Proc. IEEE ICASSP, 2020, pp. 8064–8068.
- “Prediction and generation of backchannel form for attentive listening systems.,” in Interspeech, 2016, pp. 2890–2894.
- “A simplest systematics for the organization of turn taking for conversation,” in Studies in the organization of conversational interaction, pp. 7–55. Elsevier, 1978.
- Gabriel Skantze, “Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks,” in Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017, pp. 220–230.
- “Turn-taking estimation model based on joint embedding of lexical and prosodic contents.,” in Proc. Interspeech, 2017, pp. 1686–1690.
- “Gated multimodal fusion with contrastive learning for turn-taking prediction in human-robot dialogue,” in Proc. ICASSP, 2022, pp. 7747–7751.
- “Neural dialogue context online end-of-turn detection,” in Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, 2018, pp. 224–228.
- “Online end-of-turn detection from speech based on stacked time-asynchronous sequential networks.,” in Proc. Interspeech, 2017, pp. 1661–1665.
- “Multimodal continuous turn-taking prediction using multiscale RNNs,” in Proc. ACM ICMI, 2018, pp. 186–190.
- “TurnGPT: a transformer-based language model for predicting turn-taking in spoken dialog,” in Proc. EMNLP, 2020, pp. 2981–2990.
- “GPT-3: Its nature, scope, limits, and consequences,” Minds and Machines, vol. 30, pp. 681–694, 2020.
- “Jigsaw: Large language models meet program synthesis,” in Proc. ACM ICSE, 2022, pp. 1219–1231.
- “ChatGPT for good? On opportunities and challenges of large language models for education,” Learning and individual differences, vol. 103, pp. 102274, 2023.
- “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, 2019.
- Together Computer, “RedPajama: An open source recipe to reproduce LLaMA training dataset,” https://github.com/togethercomputer/RedPajama-Data, Apr. 2023.
- “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “Finetuned language models are zero-shot learners,” in Proc. ICLR, 2022.
- “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, 2023.
- “Transformers: State-of-the-art natural language processing,” in Proc. EMNLP, 2020, pp. 38–45.
- “Switchboard: Telephone speech corpus for research and development,” in Proc. IEEE ICASSP, 1992, vol. 1, pp. 517–520.
- “Observations on overlap: findings and implications for automatic processing of multi-party conversation,” in Proc. Interspeech, 2001, pp. 1359–1362.
- “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
- Jinhan Wang (9 papers)
- Long Chen (395 papers)
- Aparna Khare (12 papers)
- Anirudh Raju (20 papers)
- Pranav Dheram (7 papers)
- Di He (108 papers)
- Minhua Wu (12 papers)
- Andreas Stolcke (57 papers)
- Venkatesh Ravichandran (12 papers)