Translatotron 2: Advancements in Direct Speech-to-Speech Translation with Voice Preservation
The paper "Translatotron 2: High-quality direct speech-to-speech translation with voice preservation" by Ye Jia et al., introduces a sophisticated neural model for direct S2ST systems, aiming at bridging the gap between the traditional cascade S2ST systems and direct S2ST models. The research outlines the significant improvements of Translatotron 2 over its predecessor, Translatotron, particularly in translation quality and naturalness of the translated speech.
Architectural Design of Translatotron 2
Translatotron 2 is designed with an innovative architecture that integrates a speech encoder, a linguistic decoder, an acoustic synthesizer, and utilizes a single attention mechanism that seamlessly connects these components. This model is trained end-to-end, leveraging a speech-to-phoneme translation objective. The encoder employs Conformer blocks for effective feature extraction from the input speech, while the decoder is responsible for generating the phonemic transcription of the target language speech.
The introduction of a duration-based autoregressive synthesizer, inspired by Non-Attentive Tacotron, mitigates the robustness issues such as over-generation that plagued earlier models. This design choice addresses the challenge of modeling long spectrogram sequences directly and improves synchronization between linguistic and acoustic outputs during translation.
Performance Evaluation
Translatotron 2 demonstrates remarkable gains in translation quality, achieving improvements up to +15.5 BLEU over its predecessor on various datasets, including Fisher Spanish-English and Conversational datasets. The model's ability to bridge the performance gap to cascade S2ST systems is illustrated by reducing the BLEU score difference to as low as 0.4 in certain scenarios.
The paper also presents a thorough evaluation of speech naturalness and robustness. Translatotron 2 matches or closely approaches the performance of cascade systems in MOS ratings, indicating high speech synthesis quality. Furthermore, the reduction in unaligned duration ratio (UDR) reflects significant gains in speech synthesis robustness, indicating less over-generation compared to the original Translatotron.
Voice Preservation Advancements
A cornerstone of this research is the novel method developed for speaker voice preservation during S2ST. Unlike previous methods that required speaker segmentation, Translatotron 2 achieves effective voice preservation even across speaker turns without explicit speaker or embedding information. This approach, which involves synthesizing training targets via a cross-lingual TTS model capable of voice transferring, places the control of voice preservation strictly within the training phase to prevent misuse, thus enhancing privacy significantly.
Implications and Future Directions
This research suggests two main implications: enhancing communication between people speaking different languages and preserving speaker characteristics in translations. Practically, advancements in models like Translatotron 2 can lead to more natural and seamless translation applications in real-world settings. Theoretically, the framework set by this work opens avenues for future enhancements in end-to-end S2ST models, particularly in areas such as simultaneous translation and processing unwritten languages. Moreover, leveraging self-supervised pre-training and weakly supervised data could further augment performance and extend applicability across various linguistic contexts.
Overall, Translatotron 2 marks a convergence point where the quality of direct S2ST begins to align with traditional systems, setting a foundation for future exploration in neural translation models. The paper underscores the importance of preserving speaker identity while ensuring high-quality translation, a pivotal aspect for future AI-driven communication tools.