Papers
Topics
Authors
Recent
2000 character limit reached

Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models (2309.12763v2)

Published 22 Sep 2023 in eess.AS, cs.CL, and cs.SD

Abstract: Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. A common approach is cross-lingual pre-training. Instead, we propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition. Our comparisons found that a combined synthetic augmentations (noise/pitch) strategy outperformed accent and language knowledge transfer. Furthermore, we examined the scaling factor of augmented data to achieve equivalent performance to model pre-trained with target domain speech. Our findings suggest that for resource-constrained languages, combined augmentations can be a viable option than other augmentations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. “An unsupervised autoregressive model for speech representation learning,” in INTERSPEECH, 2019.
  2. “Representation learning with contrastive predictive coding,” 2018.
  3. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, pp. 3451–3460, oct 2021.
  4. “Superb: Speech processing universal performance benchmark,” in Interspeech, 2021.
  5. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
  6. “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021.
  7. “Squid: Measuring speech naturalness in many languages,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  8. “The effectiveness of data augmentation in image classification using deep learning,” arXiv preprint arXiv:1712.04621, 2017.
  9. “A survey on data augmentation for text classification,” ACM Computing Surveys, vol. 55, no. 7, pp. 1–39, 2022.
  10. “Audio augmentation for speech recognition,” in Sixteenth annual conference of the international speech communication association, 2015.
  11. “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
  12. “Improving phoneme recognition with augmented autoregressive predictive coding,” in 2023 34th Irish Signals and Systems Conference (ISSC), 2023, pp. 1–6.
  13. “Data augmenting contrastive learning of speech representations in the time domain,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 215–222.
  14. “Improving noise robustness of contrastive speech representation learning with speech reconstruction,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6062–6066.
  15. “A noise-robust self-supervised pre-training model based speech representation learning for automatic speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 3174–3178.
  16. “Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7097–7101.
  17. “Wav2vec-aug: Improved self-supervised training with limited data,” in Interspeech, 2022.
  18. “Generative pre-training for speech with autoregressive predictive coding,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3497–3501, 2019.
  19. “Vector-quantized autoregressive predictive coding,” ArXiv, vol. abs/2005.08392, 2020.
  20. “Improved speech representations with multi-target autoregressive predictive coding,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020, pp. 2353–2358, Association for Computational Linguistics.
  21. “Autoregressive predictive coding: A comprehensive study,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1380–1390, 2022.
  22. “Librispeech: An asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  23. “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
  24. “Data augmenting contrastive learning of speech representations in the time domain,” arXiv preprint arXiv:2007.00991, 2020.
  25. “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6419–6423, 2019.
  26. “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Video Overview

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.