2000 character limit reached
Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection (2401.04868v1)
Published 10 Jan 2024 in cs.CL, cs.HC, cs.SD, and eess.AS
Abstract: A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.
- Gabriel Skantze. Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, 67:101178, 2021.
- Pauses, gaps and overlaps in conversations. Journal of Phonetics, 38(4):555–568, 2010.
- Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology, 6(731):1–17, 2015.
- Gabriel Skantze. Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGdial), pages 220–230, 2017.
- Online end-of-turn detection from speech based on stacked time-asynchronous sequential networks. In INTERSPEECH, pages 1661–1665, 2017.
- TurnGPT: A Transformer-based language model for predicting turn-taking in spoken dialog. In Empirical Methods in Natural Language Processing (EMNLP), pages 2981–2990, 2020.
- Response timing estimation for spoken dialog systems based on syntactic completeness prediction. In Spoken Language Technology Workshop (SLT), pages 369–374, 2023.
- Estimation of Listening Response Timing by Generative Model and Parameter Control of Response Substantialness Using Dynamic-Prompt-Tune. In INTERSPEECH, pages 2638–2642, 2023.
- Multimodal turn-taking model using visual cues for end-of-utterance prediction in spoken dialogue systems. In INTERSPEECH, pages 2658–2662, 2023.
- Optimizing the turn-taking behavior of task-oriented spoken dialog systems. ACM Transactions on Speech and Language Processing, 9(1):1–23, 2012.
- Attentive listening system with backchanneling, response generation and flexible turn-taking. In Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGdial), pages 127–136, 2017.
- Evaluation of real-time deep learning turn-taking models for multiple dialogue scenarios. In International Conference on Multimodal Interaction (ICMI), pages 78–86, 2018.
- Voice Activity Projection: Self-supervised learning of turn-taking events. In INTERSPEECH, pages 5190–5194, 2022.
- Erik Ekstedt. Predictive Modeling of Turn-Taking in Spoken Dialogue: Computational Approaches for the Analysis of Turn-Taking in Humans and Spoken Dialogue Systems. PhD thesis, KTH Royal Institute of Technology, 2023.
- Unsupervised pretraining transfers well across languages. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418, 2020.
- Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266, 2023.
- Collection and analysis of travel agency task dialogues with age-diverse speakers. In Language Resources and Evaluation Conference (LREC), pages 5759–5767, 2022.
- SWITCHBOARD: Telephone speech corpus for research and development. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 517–520, 1992.
- HKUST/MTS: A very large scale mandarin telephone speech corpus. In International Symposium Chinese Spoken Language Processing (ISCSLP), pages 724–735, 2006.
- Koji Inoue (28 papers)
- Bing'er Jiang (4 papers)
- Erik Ekstedt (8 papers)
- Tatsuya Kawahara (61 papers)
- Gabriel Skantze (29 papers)