Papers
Topics
Authors
Recent
2000 character limit reached

Applying General Turn-taking Models to Conversational Human-Robot Interaction

Published 15 Jan 2025 in cs.CL and cs.RO | (2501.08946v1)

Abstract: Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a LLM for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.

Summary

  • The paper introduces a dual-model system combining TurnGPT and VAP to improve natural turn-taking in HRI.
  • It achieves faster responses (1.5s median) and significantly reduces interruption rates compared to traditional models.
  • User studies confirm enhanced conversational fluency and human-likeness in interactions with the proposed system.

Applying General Turn-taking Models to Conversational Human-Robot Interaction

Introduction

The paper "Applying General Turn-taking Models to Conversational Human-Robot Interaction" explores advanced turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to enhance conversational dynamics in Human-Robot Interaction (HRI). Traditional HRI systems rely on silence-based models which result in unnatural pauses and interruptions. The paper introduces a method to use these models in tandem to improve response generation and turn-taking accuracy, thereby significantly reducing response delays and interruptions. The proposed system is evaluated against a traditional baseline through empirical analysis using the Furhat robot and interactions with 39 adult participants.

System Architecture

The architectural design integrates TurnGPT and VAP models with autonomous response generation to improve turn-taking dynamics in robots. Figure 1

Figure 1: System architecture. New components in proposed system shown in green.

TurnGPT: Verbal Domain Predictions

TurnGPT is a LLM extension of GPT-2, incorporating syntactic and semantic cues to predict turn yields in dialogue. It analyzes text-based dialogue for pragmatic completeness, identifying transition-relevant places (TRPs) by assigning turn completion probabilities based on training on the SODA dataset. TurnGPT's rapid response capability allows near real-time predictions despite being limited to semantic aspects.

VAP: Acoustic Domain Predictions

VAP is a transformer-based acoustic model continuously predicting conversational dynamics based on raw audio input. VAP forecasts voice activity up to two seconds ahead, detecting turn-shifts and interruptions. Trained on large dialogue datasets such as Fisher and Switchboard corpora, VAP complements TurnGPT by incorporating prosodic and timing aspects, thus addressing the limitations of text-based predictions.

Evaluation Results

The proposed turn-taking system was evaluated with metrics including response time and interruption rates, showing significant improvements over traditional baseline systems. Figure 2

Figure 2: Histogram of response times.

Response Time and Interruption Rate

The proposed system achieved faster response times with a median of 1.5 seconds compared to the baseline's 2.7 seconds and reduced interruption rates significantly (6.9% vs. 16.6%). The continuous predictions of TurnGPT and VAP enable more natural conversational flow and reduce user-perceived interruptions.

User Study

User preferences overwhelmingly favored the proposed system, highlighting its perceived fluency, human-likeness, and reduced effort required to maintain conversation dynamics. Figure 3

Figure 3: Answers to the questionnaire. Significance levels indicate Bonferroni-corrected Wilcoxon signed-rank tests (∗p<0.05^*p < 0.05, ∗∗p<0.01^{**}p < 0.01, ∗∗∗p<0.001^{***}p < 0.001).

Discussion

The research demonstrates effective adaptation of human-human turn-taking models to HRI, with potential improvements through additional data specific to the interaction setting, such as gaze cues and multi-party interactions. Future studies could explore refined response preparation techniques and flexible turn-taking configurations to further reduce delays and enhance dialogue coherence.

Conclusion

The integration of TurnGPT and VAP models significantly enhances turn-taking dynamics in HRI systems, reducing both response delays and interruption rates. The study indicates a significant preference among participants for interactions with the proposed system, paving the way for more natural and efficient human-robot conversations in diverse settings. The paper establishes foundational research for deploying more human-like interaction models in autonomous systems without domain-specific fine-tuning requirements.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.