Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation (2204.02967v3)

Published 6 Apr 2022 in cs.CL, cs.SD, and eess.AS

Abstract: Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue. We take advantage of a recently proposed speech-to-unit translation (S2UT) framework that encodes target speech into discrete representations, and transfer pre-training and efficient partial finetuning techniques that work well for speech-to-text translation (S2T) to the S2UT domain by studying both speech encoder and discrete unit decoder pre-training. Our experiments on Spanish-English translation show that self-supervised pre-training consistently improves model performance compared with multitask learning with an average 6.6-12.1 BLEU gain, and it can be further combined with data augmentation techniques that apply MT to create weakly supervised training data. Audio samples are available at: https://facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html .

PDF Abstract

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

The paper “Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation” by Sravya Popuri et al. explores innovative techniques to improve the performance and efficiency of direct speech-to-speech translation (S2ST) systems, primarily addressing the challenges posed by data scarcity in this domain. The paper introduces a robust framework leveraging self-supervised pre-training combined with data augmentation strategies to bolster model accuracy.

Overview

Traditional S2ST systems typically rely on a cascaded approach, linking automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS) components, which enhances flexibility and integration but often at the cost of speed and resource optimization. Direct S2ST systems bypass intermediate text generation, promising faster inference and potential application in translating unwritten languages. However, the development of effective direct S2ST models is impeded by the lack of extensive parallel S2ST data. The authors propose employing self-supervised pre-training on unlabeled speech data alongside data augmentation to mitigate this issue.

Methodology

The paper leverages the speech-to-unit translation (S2UT) framework, encoding target speech into discrete units rather than traditional speech waveforms. This framework facilitates advanced pre-training techniques, previously effective in speech-to-text (S2T) translation, to be adapted for S2UT. The paper particularly investigates:

Model Pre-training: The researchers apply wav2vec 2.0 pre-training on speech encoders using unlabeled data and integrate the model with an mBART text decoder trained on discrete units. Pre-training includes contrastive learning to devise robust audio representations, enhancing the transferability of learned features to downstream S2ST tasks.
Data Augmentation: The paper utilizes multiple strategies for data augmentation, employing MT and TTS models to create supplementary training data, enhancing S2ST model robustness against the backdrop of limited parallel data.
Finetuning Techniques: Several finetuning strategies are explored, such as partial finetuning focused on layered normalization and attention modules (LNA), to optimize the convergence of pre-trained models on S2ST tasks.

Results

The experiments prominently feature Spanish-English translations, displaying substantial BLEU score improvements ranging between 6.6 and 12.1 over baseline multitask learning systems. The combined use of pre-training and data augmentation also demonstrated up to a 3.1 BLEU gain when utilizing weakly supervised data. Significantly, pre-trained models with augmented data surpassed the cascaded ASR+MT+TTS systems in several configurations, particularly illustrating viability in low-resource setups.

Implications

The proposed methodologies have substantial implications for the development of S2ST technologies. Employing pre-training enables leveraging massive unlabeled data, which is critical in scenarios with scarce labeled datasets. Additionally, discrete unit representation could open avenues for translation systems applicable to unwritten languages, substantially broadening the scope of cross-linguistic communication technologies in diverse settings.

Future Directions

This investigation points to multiple trajectories for future research, including refining discrete unit representations for improved semantic retention and exploring cross-linguistic transfer in multilingual setups. Augmenting the expressive capabilities of S2ST models and enhancing the audio quality of outputs, particularly concerning prosody and speaker characteristics, remains a crucial challenge. Additionally, investigating finer-grained pre-training strategies and adaptive finetuning mechanisms could further enhance model performance in diverse linguistic contexts.

In conclusion, through methodical experimentation and comprehensive evaluation, this paper contributes a significant step forward in the optimization of direct speech-to-speech translation systems, presenting a valuable foundation for future breakthroughs in the field of multilingual communication technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Sravya Popuri (18 papers)
Peng-Jen Chen (26 papers)
Changhan Wang (46 papers)
Juan Pino (50 papers)
Yossi Adi (96 papers)
Jiatao Gu (83 papers)
Wei-Ning Hsu (76 papers)
Ann Lee (29 papers)

Citations (53)

View on Semantic Scholar

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation (2204.02967v3)