Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Voice Activity Projection: Self-supervised Learning of Turn-taking Events (2205.09812v1)

Published 19 May 2022 in eess.AS and cs.SD

Abstract: The modeling of turn-taking in dialog can be viewed as the modeling of the dynamics of voice activity of the interlocutors. We extend prior work and define the predictive task of Voice Activity Projection, a general, self-supervised objective, as a way to train turn-taking models without the need of labeled data. We highlight a theoretical weakness with prior approaches, arguing for the need of modeling the dependency of voice activity events in the projection window. We propose four zero-shot tasks, related to the prediction of upcoming turn-shifts and backchannels, and show that the proposed model outperforms prior work.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Erik Ekstedt (8 papers)
  2. Gabriel Skantze (29 papers)
Citations (22)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub