Enhancing Speech Translation with Discrete Speech Units Pretraining
Introduction to Compact Speech Translation
In the evolving field of Speech-to-Text Translation (ST), leveraging Self-Supervised Learning (SSL) models for initialization has become a standard approach for achieving state-of-the-art results. However, the substantial memory footprint of these models limits their practical applications, especially for on-device deployment. This paper presents an innovative approach that utilizes Discrete Speech Units (DSU) pretraining to condense the knowledge of large SSL models into more compact and efficient ST models. By pretraining on DSUs, the proposed method not only reduces the model size but also enhances its robustness to tokenization variations and makes it more suitable for low-resource settings.
Methodology
The core of the proposed methodology involves two stages: pretraining and fine-tuning. During pretraining, encoder-decoder models are trained on Filterbank-to-DSU and DSU-to-Translation data. This process involves using the encoder from the first model and the decoder from the second to initialize a compact model, which is then fine-tuned on limited speech-translation data. The DSUs serve as an intermediate representation that bridges speech and text modalities, effectively condensing the knowledge within the SSL model into a more accessible format for the compact model. This approach provides several advantages:
- Reduced model size: The compact model is significantly smaller than its SSL forebears.
- Robustness: By not using DSUs as direct model inputs, the method sidesteps the lengthy inference pipeline associated with them, enhancing robustness to tokenization variations.
- Low-resource applicability: Since the method does not require transcripts for pretraining, it becomes viable for low-resource languages.
To further improve the model's performance and mitigate the modality gap inherent in pretraining, the paper also explores the use of Connectionist Temporal Classification (CTC) regularization during both the DSU pretraining and the translation fine-tuning stages.
Experimental Results
The methodology was evaluated on CoVoST-2 X-En, encompassing 21 language directions, and revealed noteworthy improvements over existing methods:
- Models pretrained on DSUs outperformed direct finetuning of SSL models by over 0.5 BLEU, even with half the model size.
- The approach was found to be on par with ASR pretraining methods, showcasing its efficacy even in settings where ASR pretraining is not feasible.
Moreover, the exploration of tokenization effects underscored the method's robustness across different tokenization strategies, further emphasizing the advantages of DSU pretraining in creating compact and efficient ST models.
Future Directions
The promising results of this paper open up several avenues for future research. Investigating the impact of varying clustering sizes and the potential of other acoustic encoders could further optimize the pretraining phase. Additionally, exploring other layers or stronger SSL models for extracting DSUs holds the potential to incrementally improve the method's effectiveness while maintaining a compact model size.
Concluding Remarks
This paper presents an efficacious strategy for creating compact speech translation models through DSU pretraining, addressing significant limitations of existing methods in terms of model size and on-device deployment capabilities. Its ability to provide robust performance across various tokenizations, coupled with its suitability for low-resource settings, marks a substantial advancement in the field of speech-to-text translation. The implications of this research extend both practically, in enhancing the usability of ST models, and theoretically, in deepening our understanding of efficient model pretraining and knowledge distillation techniques.