StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning (2406.03049v1)

Published 5 Jun 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing a double challenge of translation and policy. In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. Adhering to a multi-task learning approach, StreamSpeech can perform offline and simultaneous speech recognition, speech translation and speech synthesis via an "All-in-One" seamless model. Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks. Besides, StreamSpeech is able to present high-quality intermediate results (i.e., ASR or translation results) during simultaneous translation process, offering a more comprehensive real-time communication experience.

PDF HTML Abstract

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

The paper "StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning" presents a novel approach to simultaneous speech-to-speech translation (Simul-S2ST). Simul-S2ST is a critical technology for low-latency communication, enabling the translation of speech in real-time scenarios such as international conferences and live broadcasts. StreamSpeech proposes a direct Simul-S2ST model that leverages multi-task learning to jointly optimize translation and simultaneous policy.

Overview of StreamSpeech

StreamSpeech is designed to output target speech while receiving streaming speech inputs, addressing the dual challenge of translation and policy. The multi-task learning framework allows StreamSpeech to perform offline and simultaneous tasks, including speech recognition (ASR), speech translation (S2TT), and speech synthesis (T2U), within a unified model.

Key Components:

Streaming Speech Encoder: Utilizes a chunk-based Conformer architecture to encode streaming speech inputs while maintaining bi-directional encoding within local chunks, ensuring efficient streaming capabilities.
Simultaneous Text Decoder: Employs autoregressive translation to generate target text based on source speech. The decoder uses alignments derived from ASR and non-autoregressive speech-to-text translation (NAR-S2TT) to decide appropriate translation moments, ensuring coherent target speech generation.
Non-autoregressive Text-to-Unit Generation: Translates the text hidden states into target speech using a T2U encoder and a unit CTC decoder. This non-autoregressive approach allows for efficient unit generation, which is subsequently synthesized into speech using a HiFi-GAN vocoder.

Experimental Results

The paper extensively evaluates StreamSpeech on the CVSS-C benchmark, covering language pairs such as French to English, Spanish to English, and German to English. The results demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks. Notable findings include:

Offline S2ST Performance: StreamSpeech outperforms the state-of-the-art UnitY model with an average BLEU improvement of 1.5 points.
Simul-S2ST Performance: StreamSpeech exhibits significant advantages over baseline models, especially under low latency conditions. It surpasses the wait-k policy and various cascaded systems like ASR+HMT+TTS, highlighting the effectiveness of its direct translation approach.

Implications of Multi-task Learning

Multi-task learning in StreamSpeech offers several benefits:

Intermediate Supervision: The integration of ASR and NAR-S2TT tasks provides intermediate supervision, enhancing the model's overall translation quality.
Policy Guidance: The alignment between source speech and text aids in deciding the optimal moments for translation, effectively balancing the trade-off between latency and translation quality.
Unified Optimization: The end-to-end training approach ensures joint optimization of translation and policy, making the model adaptable to different latency requirements.

Future Developments

While StreamSpeech sets a new benchmark for Simul-S2ST, future research could explore several areas:

Voice Cloning: Integrating voice cloning capabilities to preserve the speaker's voice characteristics in the target speech, enhancing the authenticity of communication.
Different Languages and Domains: Extending the model to other languages and domains to generalize its applicability.
Further Reducing Latency: Exploring advanced architectures and optimization techniques to further minimize latency without compromising translation quality.

Conclusion

StreamSpeech represents a significant advancement in simultaneous speech-to-speech translation, providing a robust solution for real-time communication. Its multi-task learning framework efficiently tackles the challenges of translation and policy, delivering high-quality translation and synthesis with low latency. The model's adaptability to various latency scenarios and its ability to present intermediate results underscore its practical value, paving the way for future innovations in the field of AI-driven real-time translation.