Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Turn-taking Prediction Using Voice Activity Projection (2403.06487v3)

Published 11 Mar 2024 in cs.CL, cs.SD, and eess.AS

Abstract: This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages. However, a multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages. Further analyses show that the multilingual model has learned to discern the language of the input signal. We also analyze the sensitivity to pitch, a prosodic cue that is thought to be important for turn-taking. Finally, we compare two different audio encoders, contrastive predictive coding (CPC) pre-trained on English, with a recent model based on multilingual wav2vec 2.0 (MMS).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. The conversational use of reactive tokens in English, Japanese, and Mandarin. Journal of pragmatics, 26(3):355–387.
  2. Mark Dingemanse and Andreas Liesenfeld. 2022. From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 5614–5633.
  3. Starkey Duncan. 1972. Some signals and rules for taking speaking turns in conversations. Journal of personality and social psychology, 23(2):283–292.
  4. Erik Ekstedt and Gabriel Skantze. 2020. TurnGPT: A Transformer-based language model for predicting turn-taking in spoken dialog. In Empirical Methods in Natural Language Processing (EMNLP), pages 2981–2990.
  5. Erik Ekstedt and Gabriel Skantze. 2022a. How much does prosody help turn-taking? Investigations using voice activity projection models. In Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGdial), pages 541–551.
  6. Erik Ekstedt and Gabriel Skantze. 2022b. Voice Activity Projection: Self-supervised learning of turn-taking events. In INTERSPEECH, pages 5190–5194.
  7. Simon Garrod and Martin J Pickering. 2015. The use of content and timing to predict turn transitions. Frontiers in psychology, 6(751):1–12.
  8. SWITCHBOARD: Telephone speech corpus for research and development. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 517–520.
  9. Agustín Gravano and Julia Hirschberg. 2011. Turn-taking cues in task-oriented dialogue. Computer Speech & Language, 25(3):601–634.
  10. Mattias Heldner and Jens Edlund. 2010. Pauses, gaps and overlaps in conversations. Journal of Phonetics, 38(4):555–568.
  11. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  12. Collection and analysis of travel agency task dialogues with age-diverse speakers. In Language Resources and Evaluation Conference (LREC), pages 5759–5767.
  13. End-of-utterance prediction by prosodic features and phrase-dependency structure in spontaneous Japanese speech. In INTERSPEECH, pages 1681–1685.
  14. Hua-Li Jian and Joyce Wu. 2011. Mandarin conversation: Turn-taking cues in exchange structure. In International Congress of Phonetic Sciences (ICPhS), pages 970–973.
  15. Turn-taking in human face-to-face interaction is multimodal: Gaze direction and manual gestures aid the coordination of turn transitions. Philosophical Transactions of the Royal Society B, 378(1875):20210473.
  16. An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs. Language and speech, 41(3-4):295–321.
  17. Multimodal turn-taking model using visual cues for end-of-utterance prediction in spoken dialogue systems. In INTERSPEECH, pages 2658–2662.
  18. Evaluation of real-time deep learning turn-taking models for multiple dialogue scenarios. In International Conference on Multimodal Interaction (ICMI), pages 78–86.
  19. Smooth turn-taking by a robot using an online continuous model to generate turn-taking cues. In International Conference on Multimodal Interaction (ICMI), pages 226–234.
  20. Attentive listening system with backchanneling, response generation and flexible turn-taking. In Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGdial), pages 127–136.
  21. Stephen C. Levinson and Francisco Torreira. 2015. Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology, 6(731):1–17.
  22. Gina-Anne Levow. 2005. Turn-taking in Mandarin dialogue: Interactions of tone and intonation. In SIGHAN Workshop on Chinese Language Processing (SIGHAN).
  23. HKUST/MTS: A very large scale mandarin telephone speech corpus. In International Symposium Chinese Spoken Language Processing (ISCSLP), pages 724–735.
  24. Towards a phonology of conversation: turn-taking in tyneside english1. Journal of Linguistics, 22(2):411–437.
  25. Online end-of-turn detection from speech based on stacked time-asynchronous sequential networks. In INTERSPEECH, pages 1661–1665.
  26. Michael McCarthy. 1991. Discourse analysis for language teachers. Cambridge university press.
  27. Toshiki Muromachi and Yoshinobu Kano. 2023. Estimation of Listening Response Timing by Generative Model and Parameter Control of Response Substantialness Using Dynamic-Prompt-Tune. In INTERSPEECH, pages 2638–2642.
  28. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266.
  29. Scaling speech technology to 1,000+ languages. arXiv preprint. ArXiv:2305.13516.
  30. Antoine Raux and Maxine Eskenazi. 2012. Optimizing the turn-taking behavior of task-oriented spoken dialog systems. ACM Transactions on Speech and Language Processing, 9(1):1–23.
  31. Unsupervised pretraining transfers well across languages. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418.
  32. A simplest systematics for the organization of turn taking for conversation. Language, 50(4):696–735.
  33. Response timing estimation for spoken dialog systems based on syntactic completeness prediction. In Spoken Language Technology Workshop (SLT), pages 369–374.
  34. Gabriel Skantze. 2017. Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGdial), pages 220–230.
  35. Gabriel Skantze. 2021. Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, 67:101178.
  36. Universals and cultural variation in turn-taking in conversation. Proceedings of the National Academy of Sciences (PNAS), 106(26):10587–10592.
  37. Nigel Ward and Wataru Tsukahara. 2000. Prosodic features which cue back-channel responses in English and Japanese. Journal of Pragmatics, 32(8):1177–1207.
  38. Marcin Włodarczak and Mattias Heldner. 2016. Respiratory turn-taking cues. In INTERSPEECH, pages 1275–1279.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Koji Inoue (28 papers)
  2. Bing'er Jiang (4 papers)
  3. Erik Ekstedt (8 papers)
  4. Tatsuya Kawahara (61 papers)
  5. Gabriel Skantze (29 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.