Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models (2309.12763v2)
Abstract: Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. A common approach is cross-lingual pre-training. Instead, we propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition. Our comparisons found that a combined synthetic augmentations (noise/pitch) strategy outperformed accent and language knowledge transfer. Furthermore, we examined the scaling factor of augmented data to achieve equivalent performance to model pre-trained with target domain speech. Our findings suggest that for resource-constrained languages, combined augmentations can be a viable option than other augmentations.
- “An unsupervised autoregressive model for speech representation learning,” in INTERSPEECH, 2019.
- “Representation learning with contrastive predictive coding,” 2018.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, pp. 3451–3460, oct 2021.
- “Superb: Speech processing universal performance benchmark,” in Interspeech, 2021.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
- “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021.
- “Squid: Measuring speech naturalness in many languages,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “The effectiveness of data augmentation in image classification using deep learning,” arXiv preprint arXiv:1712.04621, 2017.
- “A survey on data augmentation for text classification,” ACM Computing Surveys, vol. 55, no. 7, pp. 1–39, 2022.
- “Audio augmentation for speech recognition,” in Sixteenth annual conference of the international speech communication association, 2015.
- “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
- “Improving phoneme recognition with augmented autoregressive predictive coding,” in 2023 34th Irish Signals and Systems Conference (ISSC), 2023, pp. 1–6.
- “Data augmenting contrastive learning of speech representations in the time domain,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 215–222.
- “Improving noise robustness of contrastive speech representation learning with speech reconstruction,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6062–6066.
- “A noise-robust self-supervised pre-training model based speech representation learning for automatic speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 3174–3178.
- “Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7097–7101.
- “Wav2vec-aug: Improved self-supervised training with limited data,” in Interspeech, 2022.
- “Generative pre-training for speech with autoregressive predictive coding,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3497–3501, 2019.
- “Vector-quantized autoregressive predictive coding,” ArXiv, vol. abs/2005.08392, 2020.
- “Improved speech representations with multi-target autoregressive predictive coding,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020, pp. 2353–2358, Association for Computational Linguistics.
- “Autoregressive predictive coding: A comprehensive study,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1380–1390, 2022.
- “Librispeech: An asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
- “Data augmenting contrastive learning of speech representations in the time domain,” arXiv preprint arXiv:2007.00991, 2020.
- “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6419–6423, 2019.
- “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.