Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit (2303.00510v2)

Published 27 Feb 2023 in cs.SD, cs.AI, and eess.AS

Abstract: Data augmentations are known to improve robustness in speech-processing tasks. In this study, we summarize and compare different data augmentation strategies using S3PRL toolkit. We explore how HuBERT and wav2vec perform using different augmentation techniques (SpecAugment, Gaussian Noise, Speed Perturbation) for Phoneme Recognition (PR) and Automatic Speech Recognition (ASR) tasks. We evaluate model performance in terms of phoneme error rate (PER) and word error rate (WER). From the experiments, we observed that SpecAugment slightly improves the performance of HuBERT and wav2vec on the original dataset. Also, we show that models trained using the Gaussian Noise and Speed Perturbation dataset are more robust when tested with augmented test sets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020.
  3. The fifth’chime’speech separation and recognition challenge: dataset, task and baselines. arXiv preprint arXiv:1803.10609, 2018.
  4. Audio albert: A lite bert for self-supervised learning of audio representation. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 344–350. IEEE, 2021.
  5. Attention-based models for speech recognition. Advances in neural information processing systems, 28, 2015.
  6. Ville-Veikko Eklund. Data augmentation techniques for robust audio analysis. Master’s thesis, Tempere University, 2019.
  7. The application of hidden markov models in speech recognition. Foundations and Trends® in Signal Processing, 1(3):195–304, 2008.
  8. Awni Hannun. Sequence modeling with ctc. Distill, 2(11):e8, 2017.
  9. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
  10. Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5901–5905. IEEE, 2019.
  11. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  12. Elastic spectral distortion for low resource speech recognition with deep neural networks. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 309–314. IEEE, 2013.
  13. Audio augmentation for speech recognition. In Sixteenth annual conference of the international speech communication association, 2015.
  14. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5220–5224. IEEE, 2017.
  15. Jinyu Li et al. Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 11(1), 2022.
  16. Rethinking evaluation in asr: Are our models robust enough? arXiv preprint arXiv:2010.11745, 2020.
  17. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE, 2020.
  18. Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021.
  19. Whamr!: Noisy and reverberant single-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 696–700. IEEE, 2020.
  20. Speech data augmentation for improving phoneme transcriptions of aphasic speech using wav2vec 2.0 for the psst challenge. In 13th Language Resources and Evaluation Conference (LREC), pages 62–70, 2022.
  21. Improving multimodal speech recognition by data augmentation and speech representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4579–4588, 2022.
  22. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
  23. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
  24. A comparison of hybrid and end-to-end asr systems for the iberspeech-rtve 2020 speech-to-text transcription challenge. Applied Sciences, 12(2):903, 2022.
  25. The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011. IEEE Catalog No.: CFP11SRW-USB.
  26. Data augmentation for low resource languages. In INTERSPEECH 2014: 15th Annual Conference of the International Speech Communication Association, pages 810–814. International Speech Communication Association (ISCA), 2014.
  27. Unsupervised pretraining transfers well across languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418. IEEE, 2020.
  28. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019.
  29. An investigation of deep neural networks for noise robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 7398–7402. IEEE, 2013.
  30. Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282, 2017.
  31. Suramya Tomar. Converting video formats with ffmpeg. Linux Journal, 2006(146):10, 2006.
  32. Revisiting recurrent neural networks for robust asr. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4085–4088. IEEE, 2012.
  33. An overview of end-to-end automatic speech recognition. Symmetry, 11(8):1018, 2019.
  34. A comparison on data augmentation methods based on deep learning for audio classification. In Journal of Physics: Conference Series, volume 1453, page 012085. IOP Publishing, 2020.
  35. SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198, 2021. doi: 10.21437/Interspeech.2021-1775.
  36. Generalized data augmentation for low-resource translation. arXiv preprint arXiv:1906.03785, 2019.
  37. Wemix: How to better utilize data augmentation. arXiv preprint arXiv:2010.01267, 2020.
  38. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021a.
  39. Torchaudio: Building blocks for audio and speech processing. arXiv preprint arXiv:2110.15018, 2021b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Mina Huh (10 papers)
  2. Ruchira Ray (7 papers)
  3. Corey Karnei (2 papers)
Citations (3)