Compact Speech Translation Models via Discrete Speech Units Pretraining (2402.19333v2)

Published 29 Feb 2024 in cs.CL, cs.SD, and eess.AS

Abstract: We propose a pretraining method to use Self-Supervised Speech (SSS) model to creating more compact Speech-to-text Translation. In contrast to using the SSS model for initialization, our method is more suitable to memory constrained scenario such as on-device deployment. Our method is based on Discrete Speech Units (DSU) extracted from the SSS model. In the first step, our method pretrains two smaller encoder-decoder models on 1) Filterbank-to-DSU (Fbk-to-DSU) and 2) DSU-to-Translation (DSU-to-Trl) data respectively. The DSU thus become the distillation inputs of the smaller models. Subsequently, the encoder from the Fbk-to-DSU model and the decoder from the DSU-to-Trl model are taken to initialise the compact model. Finally, the compact model is finetuned on the paired Fbk-Trl data. In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings. It also avoids speech discretization in inference and is more robust to the DSU tokenization. Evaluation on CoVoST-2 (X-En) shows that our method has consistent improvement over the baseline in three metrics while being compact i.e., only half the SSS model size.

PDF HTML Abstract

Enhancing Speech Translation with Discrete Speech Units Pretraining

Introduction to Compact Speech Translation

In the evolving field of Speech-to-Text Translation (ST), leveraging Self-Supervised Learning (SSL) models for initialization has become a standard approach for achieving state-of-the-art results. However, the substantial memory footprint of these models limits their practical applications, especially for on-device deployment. This paper presents an innovative approach that utilizes Discrete Speech Units (DSU) pretraining to condense the knowledge of large SSL models into more compact and efficient ST models. By pretraining on DSUs, the proposed method not only reduces the model size but also enhances its robustness to tokenization variations and makes it more suitable for low-resource settings.

Methodology

The core of the proposed methodology involves two stages: pretraining and fine-tuning. During pretraining, encoder-decoder models are trained on Filterbank-to-DSU and DSU-to-Translation data. This process involves using the encoder from the first model and the decoder from the second to initialize a compact model, which is then fine-tuned on limited speech-translation data. The DSUs serve as an intermediate representation that bridges speech and text modalities, effectively condensing the knowledge within the SSL model into a more accessible format for the compact model. This approach provides several advantages:

Reduced model size: The compact model is significantly smaller than its SSL forebears.
Robustness: By not using DSUs as direct model inputs, the method sidesteps the lengthy inference pipeline associated with them, enhancing robustness to tokenization variations.
Low-resource applicability: Since the method does not require transcripts for pretraining, it becomes viable for low-resource languages.

To further improve the model's performance and mitigate the modality gap inherent in pretraining, the paper also explores the use of Connectionist Temporal Classification (CTC) regularization during both the DSU pretraining and the translation fine-tuning stages.

Experimental Results

The methodology was evaluated on CoVoST-2 X-En, encompassing 21 language directions, and revealed noteworthy improvements over existing methods:

Models pretrained on DSUs outperformed direct finetuning of SSL models by over 0.5 BLEU, even with half the model size.
The approach was found to be on par with ASR pretraining methods, showcasing its efficacy even in settings where ASR pretraining is not feasible.

Moreover, the exploration of tokenization effects underscored the method's robustness across different tokenization strategies, further emphasizing the advantages of DSU pretraining in creating compact and efficient ST models.

Future Directions

The promising results of this paper open up several avenues for future research. Investigating the impact of varying clustering sizes and the potential of other acoustic encoders could further optimize the pretraining phase. Additionally, exploring other layers or stronger SSL models for extracting DSUs holds the potential to incrementally improve the method's effectiveness while maintaining a compact model size.

Concluding Remarks

This paper presents an efficacious strategy for creating compact speech translation models through DSU pretraining, addressing significant limitations of existing methods in terms of model size and on-device deployment capabilities. Its ability to provide robust performance across various tokenizations, coupled with its suitability for low-resource settings, marks a substantial advancement in the field of speech-to-text translation. The implications of this research extend both practically, in enhancing the usability of ST models, and theoretically, in deepening our understanding of efficient model pretraining and knowledge distillation techniques.

PDF Markdown Bookmark Chat (Pro)

References (36)

Authors (3)

Tsz Kin Lam (13 papers)
Alexandra Birch (67 papers)
Barry Haddow (59 papers)

Citations (1)

View on Semantic Scholar