Translatotron 2: High-quality direct speech-to-speech translation with voice preservation (2107.08661v5)

Published 19 Jul 2021 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three datasets consistently show that Translatotron 2 outperforms the original Translatotron by a large margin on both translation quality (up to +15.5 BLEU) and speech generation quality, and approaches the same of cascade systems. In addition, we propose a simple method for preserving speakers' voices from the source speech to the translation speech in a different language. Unlike existing approaches, the proposed method is able to preserve each speaker's voice on speaker turns without requiring for speaker segmentation. Furthermore, compared to existing approaches, it better preserves speaker's privacy and mitigates potential misuse of voice cloning for creating spoofing audio artifacts.

Authors (4)

Ye Jia (33 papers)
Michelle Tadmor Ramanovich (7 papers)
Tal Remez (26 papers)
Roi Pomerantz (1 paper)

Citations (59)

View on Semantic Scholar

Summary

Translatotron 2: Advancements in Direct Speech-to-Speech Translation with Voice Preservation

The paper "Translatotron 2: High-quality direct speech-to-speech translation with voice preservation" by Ye Jia et al., introduces a sophisticated neural model for direct S2ST systems, aiming at bridging the gap between the traditional cascade S2ST systems and direct S2ST models. The research outlines the significant improvements of Translatotron 2 over its predecessor, Translatotron, particularly in translation quality and naturalness of the translated speech.

Architectural Design of Translatotron 2

Translatotron 2 is designed with an innovative architecture that integrates a speech encoder, a linguistic decoder, an acoustic synthesizer, and utilizes a single attention mechanism that seamlessly connects these components. This model is trained end-to-end, leveraging a speech-to-phoneme translation objective. The encoder employs Conformer blocks for effective feature extraction from the input speech, while the decoder is responsible for generating the phonemic transcription of the target language speech.

The introduction of a duration-based autoregressive synthesizer, inspired by Non-Attentive Tacotron, mitigates the robustness issues such as over-generation that plagued earlier models. This design choice addresses the challenge of modeling long spectrogram sequences directly and improves synchronization between linguistic and acoustic outputs during translation.

Performance Evaluation

Translatotron 2 demonstrates remarkable gains in translation quality, achieving improvements up to +15.5 BLEU over its predecessor on various datasets, including Fisher Spanish-English and Conversational datasets. The model's ability to bridge the performance gap to cascade S2ST systems is illustrated by reducing the BLEU score difference to as low as 0.4 in certain scenarios.

The paper also presents a thorough evaluation of speech naturalness and robustness. Translatotron 2 matches or closely approaches the performance of cascade systems in MOS ratings, indicating high speech synthesis quality. Furthermore, the reduction in unaligned duration ratio (UDR) reflects significant gains in speech synthesis robustness, indicating less over-generation compared to the original Translatotron.

Voice Preservation Advancements

A cornerstone of this research is the novel method developed for speaker voice preservation during S2ST. Unlike previous methods that required speaker segmentation, Translatotron 2 achieves effective voice preservation even across speaker turns without explicit speaker or embedding information. This approach, which involves synthesizing training targets via a cross-lingual TTS model capable of voice transferring, places the control of voice preservation strictly within the training phase to prevent misuse, thus enhancing privacy significantly.

Implications and Future Directions

This research suggests two main implications: enhancing communication between people speaking different languages and preserving speaker characteristics in translations. Practically, advancements in models like Translatotron 2 can lead to more natural and seamless translation applications in real-world settings. Theoretically, the framework set by this work opens avenues for future enhancements in end-to-end S2ST models, particularly in areas such as simultaneous translation and processing unwritten languages. Moreover, leveraging self-supervised pre-training and weakly supervised data could further augment performance and extend applicability across various linguistic contexts.

Overall, Translatotron 2 marks a convergence point where the quality of direct S2ST begins to align with traditional systems, setting a foundation for future exploration in neural translation models. The paper underscores the importance of preserving speaker identity while ensuring high-quality translation, a pivotal aspect for future AI-driven communication tools.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos