Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

D4AM: A General Denoising Framework for Downstream Acoustic Models (2311.16595v1)

Published 28 Nov 2023 in cs.SD, cs.LG, and eess.AS

Abstract: The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. However, existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. In this study, we propose a general denoising framework, D4AM, for various downstream acoustic models. Our framework fine-tunes the SE model with the backward gradient according to a specific acoustic model and the corresponding classification objective. In addition, our method aims to consider the regression objective as an auxiliary loss to make the SE model generalize to other unseen acoustic models. To jointly train an SE unit with regression and classification objectives, D4AM uses an adjustment scheme to directly estimate suitable weighting coefficients rather than undergoing a grid search process with additional training costs. The adjustment scheme consists of two parts: gradient calibration and regression objective weighting. The experimental results show that D4AM can consistently and effectively provide improvements to various unseen acoustic models and outperforms other combination setups. Specifically, when evaluated on the Google ASR API with real noisy data completely unseen during SE training, D4AM achieves a relative WER reduction of 24.65% compared with the direct feeding of noisy input. To our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general pre-processor applicable to various unseen ASR systems. Our code is available at https://github.com/ChangLee0903/D4AM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. NeurIPS, 2020.
  2. Speech enhancement with variance constrained autoencoders. In Proc. Interspeech, 2019.
  3. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, 2016.
  4. Tenet: A time-reversal enhancement network for noise-robust asr. arXiv preprint arXiv:2107.01531, 2021.
  5. Efficient lifelong learning with a-GEM. In Proc. ICLR, 2019.
  6. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proc. ICML, 2018.
  7. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In Proc. Interspeech, 2015.
  8. State-of-the-art speech recognition with sequence-to-sequence models. In Proc. ICASSP, 2018.
  9. Federated acoustic modeling for automatic speech recognition. In Proc. ICASSP, 2021.
  10. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019.
  11. Real time speech enhancement in the waveform domain. In Proc. Interspeech, 2020.
  12. Unsupervised domain adaptation by adversarial learning for robust speech recognition. In ITG Symposium on Speech Communication, 2018.
  13. Auxiliary task update decomposition: The good, the bad and the neutral. In Proc. ICLR, 2021.
  14. Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224, 2018.
  15. Investigation of monaural front-end processing for robust speech recognition without retraining or joint-training. In Proc. APSIPA ASC, 2019.
  16. Y. Ephraim and D. Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp.  1109–1121, 1984.
  17. End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp.  1570–1584, 2018.
  18. MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In Proc. ICML, 2019.
  19. MetricGAN+: An improved version of MetricGAN for speech enhancement. In Proc. Interspeech, 2021.
  20. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, 1994.
  21. Investigating NMF speech enhancement for neural network based acoustic models. In Proc. Interspeech, 2014a.
  22. Memory-enhanced neural networks and NMF for robust asr. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp.  1037–1046, 2014b.
  23. Speech denoising with deep feature losses. In Proc. Interspeech, 2019.
  24. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proc. ICML, 2006.
  25. L3das22 challenge: Learning 3d audio sources in a real office environment. In Proc. ICASSP, 2022.
  26. Conformer: Convolution-augmented transformer for speech recognition. In Proc. Interspeech, 2020.
  27. Streaming end-to-end speech recognition for mobile devices. Proc. ICASSP, 2019.
  28. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp.  2067–2079, 2010.
  29. Learning anytime predictions in neural networks via adaptive loss balancing. In Proc. AAAI, 2019.
  30. J. Jensen and C. H. Taal. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp.  2009–2022, 2016.
  31. The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes. Computer Speech & Language, pp.  605–626, 2017.
  32. Improving noise robust automatic speech recognition with single-channel time-domain enhancement network. In Proc. ICASSP, 2020.
  33. SDR-half-baked or well done? In Proc. ICASSP, 2019.
  34. SERIL: noise adaptive speech enhancement using regularization-based incremental learning. In Proc. Interspeech, 2020.
  35. An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp.  745–777, 2014.
  36. Rethinking Evaluation in ASR: Are Our Models Robust Enough? In Proc. Interspeech, 2021.
  37. Adaptive auxiliary task weighting for reinforcement learning. In Proc. NeurIPS, 2019.
  38. Philipos C. Loizou. Speech Enhancement: Theory and Practice. CRC Press, Inc., USA, 2nd edition, 2013. ISBN 1466504218.
  39. Gradient episodic memory for continual learning. In Proc. NeurIPS, 2017.
  40. Speech enhancement based on deep denoising autoencoder. In Proc. Interspeech, 2013.
  41. Multitask-based joint learning approach to robust asr for radio communication speech. In Proc. APSIPA ASC, 2021.
  42. Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In Proc. ICASSP, 2017.
  43. Cycle-consistent speech enhancement. In Proc. Interspeech, 2018.
  44. Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
  45. Deep residual-dense lattice network for speech enhancement. In Proc. AAAI, 2020.
  46. Multichannel end-to-end speech recognition. In Proc. ICML, 2017a.
  47. Does speech enhancement work with end-to-end asr objectives?: Experimental analysis of multichannel end-to-end asr. In Proc. IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), 2017b.
  48. Librispeech: an asr corpus based on public domain audio books. In Proc. ICASSP, 2015.
  49. A new framework for supervised speech enhancement in the time domain. In Proc. Interspeech, 2018.
  50. Performance analysis of the Aurora large vocabulary baseline system. In Proc. The 12th European Signal Processing Conference, 2004.
  51. An investigation of end-to-end models for robust speech recognition. In Proc. ICASSP, 2021.
  52. Josef Rajnoha. Multi-condition training for unknown environment adaptation in robust ASR under real conditions. 2009.
  53. Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In Proc. ASRU, 2017.
  54. Speechbrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624, 2021.
  55. The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results. In Proc. Interspeech, 2020.
  56. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In Proc. ICASSP, 2021.
  57. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proc. ICASSP, 2001.
  58. A streaming on-device end-to-end model surpassing server-side conventional model quality and latency. In Proc. ICASSP, 2020.
  59. English conversational telephone speech recognition by humans and machines. In Proc. Interspeech, 2017.
  60. Japanese and korean voice search. In Proc. ICASSP, 2012.
  61. An investigation of deep neural networks for noise robust speech recognition. In Proc. ICASSP, 2013.
  62. Neural machine translation of rare words with subword units. In Proc. ACL, 2016.
  63. Invariant representations for noisy speech recognition. arXiv preprint arXiv:1612.01928, 2016.
  64. Auxiliary task reweighting for minimum-data learning. In Proc. NeurIPS, 2020.
  65. HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks. In Proc. Interspeech, 2020.
  66. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proc. ICASSP, 2010.
  67. Andrew Varga and Herman J. M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, pp.  247–251, 1993.
  68. Attention is all you need. In Proc. NeurIPS, 2017.
  69. Bridging the gap between monaural speech enhancement and recognition with distortion-independent acoustic modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp.  39–48, 2020.
  70. A cross-task transfer learning approach to adapting deep speech enhancement models to unseen background noise using paired senone classifiers. In Proc. ICASSP, 2020.
  71. Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, pp. 1240–1253, 2017.
  72. Bayesian learning via stochastic gradient langevin dynamics. In Proc. ICML, 2011.
  73. Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr. In International Conference on Latent Variable Analysis and Signal Separation, pp.  91–99, 2015a.
  74. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust asr. In International Conference on Latent Variable Analysis and Signal Separation, 2015b.
  75. Unsupervised sound separation using mixture invariant training. In Proc. NeurIPS, 2020.
  76. Listening to sounds of silence for speech denoising. In Proc. NeurIPS, 2020.
  77. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015.
  78. PHASEN: A phase-and-harmonics-aware speech enhancement network. In Proc. AAAI, 2020.
  79. Reference-based speech enhancement via feature alignment and fusion network. In Proc. AAAI, 2022.
  80. DeepMMSE: A deep learning approach to MMSE-based noise power spectral density estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp.  1404–1415, 2020.
  81. Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss. In Proc. ICASSP, 2020.
  82. Learning noise invariant features through transfer learning for robust end-to-end speech recognition. In Proc. ICASSP, 2020.
  83. Interactive speech and noise modeling for speech enhancement. In Proc. AAAI, 2021.
Citations (4)

Summary

We haven't generated a summary for this paper yet.