Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion (2401.14717v1)

Published 26 Jan 2024 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS

Abstract: We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a LLM. Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (34)

Authors (9)

Jinhan Wang (9 papers)
Long Chen (395 papers)
Aparna Khare (12 papers)
Anirudh Raju (20 papers)
Pranav Dheram (7 papers)
Di He (108 papers)
Minhua Wu (12 papers)
Andreas Stolcke (57 papers)
Venkatesh Ravichandran (12 papers)

Citations (4)

View on Semantic Scholar

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion (2401.14717v1)

Related Papers

Tweets