Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compact Speech Translation Models via Discrete Speech Units Pretraining (2402.19333v2)

Published 29 Feb 2024 in cs.CL, cs.SD, and eess.AS

Abstract: We propose a pretraining method to use Self-Supervised Speech (SSS) model to creating more compact Speech-to-text Translation. In contrast to using the SSS model for initialization, our method is more suitable to memory constrained scenario such as on-device deployment. Our method is based on Discrete Speech Units (DSU) extracted from the SSS model. In the first step, our method pretrains two smaller encoder-decoder models on 1) Filterbank-to-DSU (Fbk-to-DSU) and 2) DSU-to-Translation (DSU-to-Trl) data respectively. The DSU thus become the distillation inputs of the smaller models. Subsequently, the encoder from the Fbk-to-DSU model and the decoder from the DSU-to-Trl model are taken to initialise the compact model. Finally, the compact model is finetuned on the paired Fbk-Trl data. In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings. It also avoids speech discretization in inference and is more robust to the DSU tokenization. Evaluation on CoVoST-2 (X-En) shows that our method has consistent improvement over the baseline in three metrics while being compact i.e., only half the SSS model size.

Enhancing Speech Translation with Discrete Speech Units Pretraining

Introduction to Compact Speech Translation

In the evolving field of Speech-to-Text Translation (ST), leveraging Self-Supervised Learning (SSL) models for initialization has become a standard approach for achieving state-of-the-art results. However, the substantial memory footprint of these models limits their practical applications, especially for on-device deployment. This paper presents an innovative approach that utilizes Discrete Speech Units (DSU) pretraining to condense the knowledge of large SSL models into more compact and efficient ST models. By pretraining on DSUs, the proposed method not only reduces the model size but also enhances its robustness to tokenization variations and makes it more suitable for low-resource settings.

Methodology

The core of the proposed methodology involves two stages: pretraining and fine-tuning. During pretraining, encoder-decoder models are trained on Filterbank-to-DSU and DSU-to-Translation data. This process involves using the encoder from the first model and the decoder from the second to initialize a compact model, which is then fine-tuned on limited speech-translation data. The DSUs serve as an intermediate representation that bridges speech and text modalities, effectively condensing the knowledge within the SSL model into a more accessible format for the compact model. This approach provides several advantages:

  • Reduced model size: The compact model is significantly smaller than its SSL forebears.
  • Robustness: By not using DSUs as direct model inputs, the method sidesteps the lengthy inference pipeline associated with them, enhancing robustness to tokenization variations.
  • Low-resource applicability: Since the method does not require transcripts for pretraining, it becomes viable for low-resource languages.

To further improve the model's performance and mitigate the modality gap inherent in pretraining, the paper also explores the use of Connectionist Temporal Classification (CTC) regularization during both the DSU pretraining and the translation fine-tuning stages.

Experimental Results

The methodology was evaluated on CoVoST-2 X-En, encompassing 21 language directions, and revealed noteworthy improvements over existing methods:

  • Models pretrained on DSUs outperformed direct finetuning of SSL models by over 0.5 BLEU, even with half the model size.
  • The approach was found to be on par with ASR pretraining methods, showcasing its efficacy even in settings where ASR pretraining is not feasible.

Moreover, the exploration of tokenization effects underscored the method's robustness across different tokenization strategies, further emphasizing the advantages of DSU pretraining in creating compact and efficient ST models.

Future Directions

The promising results of this paper open up several avenues for future research. Investigating the impact of varying clustering sizes and the potential of other acoustic encoders could further optimize the pretraining phase. Additionally, exploring other layers or stronger SSL models for extracting DSUs holds the potential to incrementally improve the method's effectiveness while maintaining a compact model size.

Concluding Remarks

This paper presents an efficacious strategy for creating compact speech translation models through DSU pretraining, addressing significant limitations of existing methods in terms of model size and on-device deployment capabilities. Its ability to provide robust performance across various tokenizations, coupled with its suitability for low-resource settings, marks a substantial advancement in the field of speech-to-text translation. The implications of this research extend both practically, in enhancing the usability of ST models, and theoretically, in deepening our understanding of efficient model pretraining and knowledge distillation techniques.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 1–61, Toronto, Canada (in-person and online). Association for Computational Linguistics.
  2. Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data. In Proc. Interspeech 2022, pages 2658–2662.
  3. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  4. Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study. arXiv preprint arXiv:2309.15800.
  5. Toward joint language modeling for speech units and text. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6582–6593, Singapore. Association for Computational Linguistics.
  6. Qingkai Fang and Yang Feng. 2023. Back translation for speech-to-text translation without transcripts. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4567–4587, Toronto, Canada. Association for Computational Linguistics.
  7. CTC-based compression for direct speech translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 690–696, Online. Association for Computational Linguistics.
  8. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, volume 148 of ACM International Conference Proceeding Series, pages 369–376. ACM.
  9. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pages 5036–5040.
  10. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  11. Source and target bidirectional knowledge distillation for end-to-end speech translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1872–1881, Online. Association for Computational Linguistics.
  12. Libri-light: A benchmark for ASR with limited or no supervision. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages 7669–7673. IEEE.
  13. Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
  14. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  15. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  16. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
  17. Pre-training for speech translation: CTC meets optimal transport. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 18667–18685. PMLR.
  18. Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920.
  19. Toan Q. Nguyen and Julian Salazar. 2019. Transformers without tears: Improving the normalization of self-attention. In Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong. Association for Computational Linguistics.
  20. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
  21. When good and reproducible results are a giant with feet of clay: The importance of software quality in nlp. arXiv preprint arXiv:2303.16166.
  22. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. Interspeech 2019, pages 2613–2617.
  23. A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks. In Proc. INTERSPEECH 2023, pages 2208–2212.
  24. Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech 2021, pages 3615–3619.
  25. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  26. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  27. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  28. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  29. Attention is all you need. Advances in neural information processing systems, 30.
  30. Fairseq S2T: Fast speech-to-text modeling with fairseq. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations, pages 33–39, Suzhou, China. Association for Computational Linguistics.
  31. CoVoST 2 and Massively Multilingual Speech Translation. In Proc. Interspeech 2021, pages 2247–2251.
  32. Wav2seq: Pre-training speech-to-text encoder-decoder models using pseudo languages. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE.
  33. Revisiting end-to-end speech-to-text translation from scratch. In International Conference on Machine Learning, pages 26193–26205. PMLR.
  34. Efficient CTC regularization via coarse labels for end-to-end speech translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2264–2276, Dubrovnik, Croatia. Association for Computational Linguistics.
  35. DUB: Discrete unit back-translation for speech translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7147–7164, Toronto, Canada. Association for Computational Linguistics.
  36. SpeechUT: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1663–1676, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tsz Kin Lam (13 papers)
  2. Alexandra Birch (67 papers)
  3. Barry Haddow (59 papers)
Citations (1)