Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech Robust Bench: A Robustness Benchmark For Speech Recognition (2403.07937v3)

Published 8 Mar 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 114 input perturbations which simulate an heterogeneous range of corruptions that ASR models may encounter when deployed in the wild. We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as the use of discrete representations, or self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females. Our results revealed noticeable disparities in the model's robustness across subgroups. We believe that SRB will significantly facilitate future research towards robust ASR models, by making it easier to conduct comprehensive and comparable robustness evaluations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Advances in adversarial attacks and defenses in computer vision: A survey. IEEE Access, 9:155161–155196, 2021.
  2. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pp.  173–182. PMLR, 2016.
  3. Square attack: a query-efficient black-box adversarial attack via random search. In European Conference on Computer Vision, pp.  484–501. Springer, 2020.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  5. The fifth ’chime’ speech separation and recognition challenge: Dataset, task and baselines. In Proc. Interspeech 2018, pp.  1561–1565, 2018. doi: 10.21437/Interspeech.2018-1768.
  6. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pp.  2709–2720. PMLR, 2022.
  7. Noise-robust speech recognition with 10 minutes unparalleled in-domain data. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  4298–4302. IEEE, 2022.
  8. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pp.  15–26, 2017.
  9. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979, 2020.
  10. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pp.  2206–2216. PMLR, 2020.
  11. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020.
  12. Explaining and harnessing adversarial examples, 2014.
  13. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  14. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8340–8349, 2021a.
  15. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15262–15271, 2021b.
  16. Traps-classifiers of temporal patterns. In Fifth International Conference on Spoken Language Processing, 1998.
  17. Rasta-plp speech analysis. In Proc. IEEE Int’l Conf. Acoustics, speech and signal processing, volume 1, pp.  121–124, 1991.
  18. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ASR2000-Automatic speech recognition: challenges for the new Millenium ISCA tutorial and research workshop (ITRW), 2000.
  19. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021a.
  20. Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training. arXiv preprint arXiv:2104.01027, 2021b.
  21. A binaural room impulse response database for the evaluation of dereverberation algorithms. In 2009 16th International Conference on Digital Signal Processing, pp.  1–5. IEEE, 2009.
  22. Signal bias removal by maximum likelihood estimation for robust telephone speech recognition. IEEE Transactions on Speech and Audio Processing, 4(1):19, 1996.
  23. Power-normalized cepstral coefficients (pncc) for robust speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 24(7):1315–1329, 2016.
  24. The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp.  1–4. IEEE, 2013.
  25. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5220–5224. IEEE, 2017.
  26. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117(14):7684–7689, 2020.
  27. Perceptual adversarial robustness: Defense against unseen threat models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=dFwBosAcJkN.
  28. Lander, T. CSLU: Foreign Accented English Release 1.2, 2022. URL https://doi.org/10.5683/SP2/K7EQTE.
  29. An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):745–777, 2014.
  30. Rethinking evaluation in asr: Are our models robust enough? arXiv preprint arXiv:2010.11745, 2020.
  31. Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6162–6166. IEEE, 2022.
  32. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
  33. Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on audio, speech, and language processing, 21(10):2140–2151, 2013.
  34. Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. In LREC, pp.  965–968, 2000.
  35. Universal adversarial perturbations for speech recognition systems. arXiv preprint arXiv:1905.03828, 2019.
  36. Recent improvements of asr models in the face of adversarial attacks. Interspeech, 2022. URL https://arxiv.org/abs/2203.16536.
  37. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5206–5210. IEEE, 2015.
  38. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.
  39. Piczak, K. J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp.  1015–1018. ACM Press, 2015. ISBN 978-1-4503-3459-4. doi: 10.1145/2733373.2806390. URL http://dl.acm.org/citation.cfm?doid=2733373.2806390.
  40. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020.
  41. Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516, 2023.
  42. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp.  28492–28518. PMLR, 2023.
  43. The accented english speech recognition challenge 2020: open datasets, tracks, baselines, results and methods. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6918–6922. IEEE, 2021.
  44. Hearing is believing: Biologically inspired methods for robust automatic speech recognition. IEEE signal processing magazine, 29(6):34–43, 2012.
  45. Intriguing properties of neural networks. In ICLR, 2014. URL http://arxiv.org/abs/1312.6199.
  46. Improving fairness and robustness in end-to-end speech recognition through unsupervised clustering. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  47. Query efficient decision based sparse attacks against black-box deep learning models. arXiv preprint arXiv:2202.00091, 2022.
  48. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840, 2021a.
  49. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390, 2021b.
  50. Black-box adversarial attacks on deep neural networks: A survey. In 2022 4th International Conference on Data Intelligence and Security (ICDIS), pp.  88–93. IEEE, 2022a.
  51. Measure and improve robustness in nlp models: A survey. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4569–4586, 2022b.
  52. Learning structured sparsity in deep neural networks. In NIPS, 2016.
  53. Feature-guided black-box safety testing of deep neural networks. In Tools and Algorithms for the Construction and Analysis of Systems: 24th International Conference, TACAS 2018, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2018, Thessaloniki, Greece, April 14-20, 2018, Proceedings, Part I 24, pp.  408–426. Springer, 2018.
  54. Speech denoising using nonnegative matrix factorization with priors. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  4029–4032. IEEE, 2008.
  55. Self-training and pre-training are complementary for speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  3030–3034. IEEE, 2021.
  56. CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92), 2019.
  57. Unsupervised adaptation with discriminative mapping transforms. IEEE Transactions on Audio, Speech, and Language Processing, 17(4):714–723, 2009.
  58. Towards query-efficient black-box adversary with zeroth-order natural gradient descent. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  6909–6916, 2020.
Citations (3)

Summary

We haven't generated a summary for this paper yet.