Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics (2503.01174v1)

Published 3 Mar 2025 in cs.CL, cs.SD, and eess.AS

Abstract: The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.

Authors (5)

Siddhant Arora (50 papers)
Zhiyun Lu (19 papers)
Chung-Cheng Chiu (48 papers)
Ruoming Pang (59 papers)
Shinji Watanabe (416 papers)

Summary

An Expert Review of "Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics"

The paper "Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics" presents a comprehensive paper on the evaluation of Audio Foundation Models (FMs) in the context of turn-taking dynamics during conversations. This research highlights the significance of conversational fluidity in speech-based systems, specifically focusing on the execution and timing of turn-taking events.

The core contributions of the paper are the introduction of a novel evaluation protocol that assesses the turn-taking capabilities of spoken dialogue systems, empirical insights into existing dialogue systems, and the evaluation of various open-source and proprietary audio FMs.

Evaluation Protocol for Turn-Taking

The proposed evaluation protocol significantly innovates on previous methodologies by focusing on the timing of turn-taking events. Traditional metrics largely analyzed global distributions of these events without addressing specific temporal contexts where they occur. The authors introduce a supervised model trained on human-human conversations to predict the timing of turn changes, backchannels, interruptions, and other conversational events. This predictive model is used to automatically evaluate the ability of audio FMs to manage turns effectively, reducing the need for labor-intensive and costly human relevance judgments.

Insights into Current Systems

The user paper reveals critical insights into the turn-taking practices of prominent spoken dialogue systems, including Moshi—a full-duplex E2E spoken dialogue system—and a VAD-based cascaded system. These assessments unveiled substantial deficiencies: both systems failed to reliably detect when to initiate or yield turns, rarely produced appropriate backchannel cues, and Moshi in particular interrupted conversations unpredictably. These findings indicate a significant gap between the natural conversational abilities of human speakers and those of current audio FMs.

Evaluation of Audio FMs on Turn-Taking Dynamics

The paper includes an evaluation of well-known audio FMs using a curated test benchmark from the Switchboard corpus to determine their proficiency in understanding and predicting turn-taking events. Results demonstrate that while some systems like Whisper+GPT-4o show relative strengths in predicting turn changes, all models reveal substantial shortcomings, particularly in recognizing backchannels and managing interruptions. This highlights a clear direction for further research and development in enhancing conversational proficiency in audio FMs.

Implications and Future Directions

This paper provides critical foundations for advancing conversational AI through improved turn management strategies. Practically, it suggests pathways for developing more naturally interactive voice assistants. Theoretically, it emphasizes the need for more sophisticated multimodal dialogue models capable of nuanced event prediction. Future research might refine supervised approaches, expand turn-taking datasets across languages, and explore novel architectures for real-time conversational engagement.

Overall, the contribution of this paper lies in its methodological advancements and insightful analysis revealing the current challenges and potential improvements in the domain of audio foundation models. The open-sourcing of the evaluation platform promises to foster further research and development in conversational AI systems.

Related Papers

Tweets

https://twitter.com/Sid_Arora_18/status/1897315720205328593

https://twitter.com/WilliamBarrHeld/status/1920173778203783475