Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection (2401.04868v1)

Published 10 Jan 2024 in cs.CL, cs.HC, cs.SD, and eess.AS

Abstract: A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Gabriel Skantze. Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, 67:101178, 2021.
  2. Pauses, gaps and overlaps in conversations. Journal of Phonetics, 38(4):555–568, 2010.
  3. Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology, 6(731):1–17, 2015.
  4. Gabriel Skantze. Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGdial), pages 220–230, 2017.
  5. Online end-of-turn detection from speech based on stacked time-asynchronous sequential networks. In INTERSPEECH, pages 1661–1665, 2017.
  6. TurnGPT: A Transformer-based language model for predicting turn-taking in spoken dialog. In Empirical Methods in Natural Language Processing (EMNLP), pages 2981–2990, 2020.
  7. Response timing estimation for spoken dialog systems based on syntactic completeness prediction. In Spoken Language Technology Workshop (SLT), pages 369–374, 2023.
  8. Estimation of Listening Response Timing by Generative Model and Parameter Control of Response Substantialness Using Dynamic-Prompt-Tune. In INTERSPEECH, pages 2638–2642, 2023.
  9. Multimodal turn-taking model using visual cues for end-of-utterance prediction in spoken dialogue systems. In INTERSPEECH, pages 2658–2662, 2023.
  10. Optimizing the turn-taking behavior of task-oriented spoken dialog systems. ACM Transactions on Speech and Language Processing, 9(1):1–23, 2012.
  11. Attentive listening system with backchanneling, response generation and flexible turn-taking. In Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGdial), pages 127–136, 2017.
  12. Evaluation of real-time deep learning turn-taking models for multiple dialogue scenarios. In International Conference on Multimodal Interaction (ICMI), pages 78–86, 2018.
  13. Voice Activity Projection: Self-supervised learning of turn-taking events. In INTERSPEECH, pages 5190–5194, 2022.
  14. Erik Ekstedt. Predictive Modeling of Turn-Taking in Spoken Dialogue: Computational Approaches for the Analysis of Turn-Taking in Humans and Spoken Dialogue Systems. PhD thesis, KTH Royal Institute of Technology, 2023.
  15. Unsupervised pretraining transfers well across languages. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418, 2020.
  16. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266, 2023.
  17. Collection and analysis of travel agency task dialogues with age-diverse speakers. In Language Resources and Evaluation Conference (LREC), pages 5759–5767, 2022.
  18. SWITCHBOARD: Telephone speech corpus for research and development. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 517–520, 1992.
  19. HKUST/MTS: A very large scale mandarin telephone speech corpus. In International Symposium Chinese Spoken Language Processing (ISCSLP), pages 724–735, 2006.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Koji Inoue (28 papers)
  2. Bing'er Jiang (4 papers)
  3. Erik Ekstedt (8 papers)
  4. Tatsuya Kawahara (61 papers)
  5. Gabriel Skantze (29 papers)
Citations (6)

Summary

Introduction to Turn-taking Prediction

Turn-taking is an integral part of communication, where predicting when a speaker will start or stop speaking is key to facilitating smooth dialogue exchanges. Traditional spoken dialogue systems (SDSs) often rely on fixed silence thresholds to determine speaker transitions, which can lead to interruptions or delays in responses. This approach can overlook the complexities involved in human conversational patterns, where pauses are not necessarily indicative of turn-end.

Voice Activity Projection Model

Addressing these challenges, a new framework for turn-taking prediction has been introduced, known as voice activity projection (VAP). This innovative model directly maps stereo audio of dialogue to future voice activities using contrastive predictive coding (CPC) and transformer networks. It functions by anticipating near future voice activities, processing the audio signal via multi-layer transformers, and utilizing a multitask learning approach for both voice activity projection (predicting future speaking probability) and detection (identifying current voice activity).

The VAP System in Action

The performance of the VAP system has been examined and its potential in real-time applications showcased. Crucially, the model's ability to function effectively in real-time scenarios using only a CPU has been confirmed. The paper explored various input sequence lengths, demonstrating that a balance can be struck between maintaining high accuracy in prediction and achieving the necessary processing speed for real-time interaction. The VAP model, through an explanatory GUI, accurately indicates the likelihood of a speaker either continuing their turn or transitioning to the other participant, including during moments of uncertainty where next speaker is not immediately clear.

Future Directions and Conclusion

The presented work demonstrates the possibility of enhancing SDSs with a level of turn-taking nuance previously unattainable with simple silence timeout thresholds. With robust performance even with reduced input sequence lengths, the VAP model emerges as a promising solution for real-time SDSs. Future pursuits include further integration of the VAP system into various dialogue systems and extensive testing through dialogue experiments to evaluate its full potential. The research is well-supported by significant funding sources, positioning it as a critical step forward in the evolution of human-machine communication.

X Twitter Logo Streamline Icon: https://streamlinehq.com