Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion (2401.14717v1)

Published 26 Jan 2024 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS

Abstract: We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a LLM. Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Matthew B. Hoy, “Alexa, Siri, Cortana, and more: an introduction to voice assistants,” Medical reference services quarterly, vol. 37, no. 1, pp. 81–88, 2018.
  2. “Virtual agents as daily assistants for elderly or cognitively impaired people: Studies on acceptance and interaction feasibility,” in Intelligent Virtual Agents, 2013, pp. 79–91.
  3. “Study of a home robot: Jibo,” International journal of engineering research and technology, vol. 3, no. 10, pp. 490–493, 2014.
  4. “Ten challenges in highly-interactive dialog system.,” in AAAI Spring Symposium, 2015.
  5. “Evaluation of real-time deep learning turn-taking models for multiple dialogue scenarios,” in Proc. ACM ICMI, 2018, pp. 78–86.
  6. “Turn-taking prediction for natural conversational speech,” in Proc. Interspeech, 2022, pp. 1821–1825.
  7. “Turn-taking, feedback and joint attention in situated human-robot interaction,” Speech Communication, vol. 65, pp. 50–66, 2014.
  8. “Dialog prediction for a general model of turn-taking,” in Proc. Interspeech, 2010, pp. 2662–2665.
  9. “Improving end-of-turn detection in spoken dialogues by detecting speaker intentions as a secondary task,” in Proc. ICASSP, 2018, pp. 6159–6163.
  10. “Voice activity projection: Self-supervised learning of turn-taking events,” in Proc. Interspeech, 2022, pp. 5190–5194.
  11. Gabriel Skantze, “Turn-taking in conversational systems and human-robot interaction: a review,” Computer Speech & Language, vol. 67, pp. 101178, 2021.
  12. “Using neural networks for data-driven backchannel prediction: A survey on input features and training techniques,” in Human-Computer Interaction: Interaction Technologies, 2015, pp. 329–340.
  13. “Oh, jeez! or uh-huh? a listener-aware backchannel predictor on ASR transcriptions,” in Proc. IEEE ICASSP, 2020, pp. 8064–8068.
  14. “Prediction and generation of backchannel form for attentive listening systems.,” in Interspeech, 2016, pp. 2890–2894.
  15. “A simplest systematics for the organization of turn taking for conversation,” in Studies in the organization of conversational interaction, pp. 7–55. Elsevier, 1978.
  16. Gabriel Skantze, “Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks,” in Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017, pp. 220–230.
  17. “Turn-taking estimation model based on joint embedding of lexical and prosodic contents.,” in Proc. Interspeech, 2017, pp. 1686–1690.
  18. “Gated multimodal fusion with contrastive learning for turn-taking prediction in human-robot dialogue,” in Proc. ICASSP, 2022, pp. 7747–7751.
  19. “Neural dialogue context online end-of-turn detection,” in Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, 2018, pp. 224–228.
  20. “Online end-of-turn detection from speech based on stacked time-asynchronous sequential networks.,” in Proc. Interspeech, 2017, pp. 1661–1665.
  21. “Multimodal continuous turn-taking prediction using multiscale RNNs,” in Proc. ACM ICMI, 2018, pp. 186–190.
  22. “TurnGPT: a transformer-based language model for predicting turn-taking in spoken dialog,” in Proc.  EMNLP, 2020, pp. 2981–2990.
  23. “GPT-3: Its nature, scope, limits, and consequences,” Minds and Machines, vol. 30, pp. 681–694, 2020.
  24. “Jigsaw: Large language models meet program synthesis,” in Proc. ACM ICSE, 2022, pp. 1219–1231.
  25. “ChatGPT for good? On opportunities and challenges of large language models for education,” Learning and individual differences, vol. 103, pp. 102274, 2023.
  26. “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, 2019.
  27. Together Computer, “RedPajama: An open source recipe to reproduce LLaMA training dataset,” https://github.com/togethercomputer/RedPajama-Data, Apr. 2023.
  28. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  29. “Finetuned language models are zero-shot learners,” in Proc. ICLR, 2022.
  30. “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, 2023.
  31. “Transformers: State-of-the-art natural language processing,” in Proc.  EMNLP, 2020, pp. 38–45.
  32. “Switchboard: Telephone speech corpus for research and development,” in Proc. IEEE ICASSP, 1992, vol. 1, pp. 517–520.
  33. “Observations on overlap: findings and implications for automatic processing of multi-party conversation,” in Proc. Interspeech, 2001, pp. 1359–1362.
  34. “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jinhan Wang (9 papers)
  2. Long Chen (395 papers)
  3. Aparna Khare (12 papers)
  4. Anirudh Raju (20 papers)
  5. Pranav Dheram (7 papers)
  6. Di He (108 papers)
  7. Minhua Wu (12 papers)
  8. Andreas Stolcke (57 papers)
  9. Venkatesh Ravichandran (12 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets