Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention (2110.08250v2)

Published 15 Oct 2021 in cs.CL, cs.SD, and eess.AS

Abstract: We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units, in which a sequence of discrete representations, instead of continuous spectrogram features, learned in an unsupervised manner, are predicted from the model and passed directly to a vocoder for speech synthesis on-the-fly. We also introduce the variational monotonic multihead attention (V-MMA), to handle the challenge of inefficient policy learning in speech simultaneous translation. The simultaneous policy then operates on source speech features and target discrete units. We carry out empirical studies to compare cascaded and direct approach on the Fisher Spanish-English and MuST-C English-Spanish datasets. Direct simultaneous model is shown to outperform the cascaded model by achieving a better tradeoff between translation quality and latency.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Xutai Ma (23 papers)
  2. Hongyu Gong (44 papers)
  3. Danni Liu (23 papers)
  4. Ann Lee (29 papers)
  5. Yun Tang (42 papers)
  6. Peng-Jen Chen (26 papers)
  7. Wei-Ning Hsu (76 papers)
  8. Phillip Koehn (1 paper)
  9. Juan Pino (51 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.