Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR (2104.01393v2)

Published 3 Apr 2021 in cs.CL and eess.AS

Abstract: We propose an on-the-fly data augmentation method for automatic speech recognition (ASR) that uses alignment information to generate effective training samples. Our method, called Aligned Data Augmentation (ADA) for ASR, replaces transcribed tokens and the speech representations in an aligned manner to generate previously unseen training pairs. The speech representations are sampled from an audio dictionary that has been extracted from the training corpus and inject speaker variations into the training examples. The transcribed tokens are either predicted by a LLM such that the augmented data pairs are semantically close to the original data, or randomly sampled. Both strategies result in training pairs that improve robustness in ASR training. Our experiments on a Seq-to-Seq architecture show that ADA can be applied on top of SpecAugment, and achieves about 9-23% and 4-15% relative improvements in WER over SpecAugment alone on LibriSpeech 100h and LibriSpeech 960h test datasets, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in IEEE, vol. 86, no. 11.   IEEE, 1998, pp. 2278–2324.
  2. P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” in ICDAR.   IEEE, 2003, pp. 958–962.
  3. X. Wang, H. Pham, Z. Dai, and G. Neubig, “SwitchOut: an efficient data augmentation algorithm for neural machine translation,” in EMNLP.   ACL, 2018, pp. 856–861.
  4. M. Fadaee, A. Bisazza, and C. Monz, “Data augmentation for low-resource neural machine translation,” in ACL.   ACL, 2017, pp. 567–573.
  5. R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” in ACL.   ACL, 2016, pp. 86–96.
  6. T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in INTERSPEECH.   ISCA, 2015, pp. 3586–3589.
  7. T. Nguyen, S. Stüker, J. Niehues, and A. Waibel, “Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation,” in ICASSP.   IEEE, 2020, pp. 7689–7693.
  8. D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in INTERSPEECH.   ISCA, 2019, pp. 2613–2617.
  9. J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” in ICASSP.   IEEE, 2020, pp. 7084–7088.
  10. D. Liang, Z. Huang, and Z. C. Lipton, “Learning noise-invariant representations for robust speech recognition,” in IEEE SLT Workshop.   IEEE, 2018, pp. 56–63.
  11. A. Renduchintala, S. Ding, M. Wiesner, and S. Watanabe, “Multi-modal data augmentation for end-to-end ASR,” in INTERSPEECH.   ISCA, 2018, pp. 2394–2398.
  12. A. Laptev, R. Korostik, A. Svischev, A. Andrusenko, I. Medennikov, and S. Rybin, “You do not need more data: Improving end-to-end speech recognition by text-to-speech data augmentation,” in CISP-BMEI.   IEEE, 2020, pp. 439–444.
  13. N. Rossenbach, A. Zeyer, R. Schlüter, and H. Ney, “Generating synthetic audio data for attention-based speech recognition systems,” in ICASSP.   IEEE, 2020, pp. 7069–7073.
  14. C. Du, H. Li, Y. Lu, L. Wang, and Y. Qian, “Data augmentation for end-to-end code-switching speech recognition,” in IEEE SLT Workshop.   IEEE, 2021, pp. 194–200.
  15. N. Ng, K. Cho, and M. Ghassemi, “SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness,” in EMNLP.   ACL, 2020, pp. 1268–1283.
  16. N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (vtlp) improves speech recognition,” in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
  17. X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep neural network acoustic modeling,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 23, no. 9, pp. 1469–1477, 2015.
  18. S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in ICCV.   IEEE, 2019, pp. 6022–6031.
  19. E. Salesky, M. Sperber, and A. W. Black, “Exploring phoneme-level speech representations for end-to-end speech translation,” in ACL.   ACL, 2019, pp. 1835–1841.
  20. T.-Y. Hu, A. Shrivastava, R. Chang, H. Koppula, S. Braun, K. Hwang, O. Kalini, and O. Tuzel, “Sapaugment: Learning A sample adaptive policy for data augmentation,” CoRR, vol. abs/2011.01156, 2020.
  21. C. Wang, Y. Tang, X. Ma, A. Wu, D. Okhonko, and J. Pino, “Fairseq S2T: Fast speech-to-text modeling with fairseq,” in AACL: System Demonstrations.   ACL, 2020, pp. 33–39.
  22. S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  23. S. Karita, X. Wang, S. Watanabe, T. Yoshimura, W. Zhang, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, and R. Yamamoto, “A comparative study on transformer vs RNN in speech applications,” in ASRU.   IEEE, 2019, pp. 449–456.
  24. M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi,” in INTERSPEECH.   ISCA, 2017, pp. 498–502.
  25. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019.
  26. M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” in NAACL: Demonstrations.   ACL, 2019, pp. 48–53.
  27. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in ICASSP.   IEEE, 2015, pp. 5206–5210.
  28. T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in EMNLP.   ACL, 2018, pp. 66–71.
  29. S. Riezler and J. T. Maxwell, “On some pitfalls in automatic evaluation and significance testing for MT,” in ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.   ACL, 2005, pp. 57–64.
  30. C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney, “RWTH ASR systems for librispeech: Hybrid vs attention,” in INTERSPEECH.   ISCA, 2019, pp. 231–235.
  31. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “Espnet: End-to-end speech processing toolkit,” in INTERSPEECH.   ISCA, 2018, pp. 2207–2211.
  32. L. Gillick and S. J. Cox, “Some statistical issues in the comparison of speech recognition algorithms,” in ICASSP.   IEEE, 1989, pp. 532–535.
  33. M. Bisani and H. Ney, “Bootstrap estimates for confidence intervals in ASR performance evaluation,” in ICASSP.   IEEE, 2004, pp. 409–412.
  34. W. Hoeffding, “The large-sample power of tests based on permutations of observations,” Annals of Mathematical Statistics, vol. 23, pp. 169–192, 1952.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tsz Kin Lam (13 papers)
  2. Mayumi Ohta (5 papers)
  3. Shigehiko Schamoni (10 papers)
  4. Stefan Riezler (44 papers)
Citations (26)

Summary

We haven't generated a summary for this paper yet.