Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation (2406.06937v2)

Published 11 Jun 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2X outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhengrui Ma (18 papers)
  2. Qingkai Fang (19 papers)
  3. Shaolei Zhang (36 papers)
  4. Shoutao Guo (17 papers)
  5. Yang Feng (230 papers)
  6. Min Zhang (630 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com