Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information (2011.07442v5)

Published 15 Nov 2020 in cs.SD, cs.LG, and eess.AS

Abstract: Previous studies have confirmed that by augmenting acoustic features with the place/manner of articulatory features, the speech enhancement (SE) process can be guided to consider the broad phonetic properties of the input speech when performing enhancement to attain performance improvements. In this paper, we explore the contextual information of articulatory attributes as additional information to further benefit SE. More specifically, we propose to improve the SE performance by leveraging losses from an end-to-end automatic speech recognition (E2E-ASR) model that predicts the sequence of broad phonetic classes (BPCs). We also developed multi-objective training with ASR and perceptual losses to train the SE system based on a BPC-based E2E-ASR. Experimental results from speech denoising, speech dereverberation, and impaired speech enhancement tasks confirmed that contextual BPC information improves SE performance. Moreover, the SE model trained with the BPC-based E2E-ASR outperforms that with the phoneme-based E2E-ASR. The results suggest that objectives with misclassification of phonemes by the ASR system may lead to imperfect feedback, and BPC could be a potentially better choice. Finally, it is noted that combining the most-confusable phonetic targets into the same BPC when calculating the additional objective can effectively improve the SE performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. “A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 7, pp. 1568–1578, 2016.
  2. D. Wang, “Deep learning reinvents the hearing aid,” IEEE spectrum, vol. 54, no. 3, pp. 32–37, 2017.
  3. “A modular approach to speech enhancement with an application to speech coding,” in Proc. ICASSP 1999, 1999.
  4. R. Martin and R.V. Cox, “New speech enhancement techniques for low bit rate speech coding,” in Proc. Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria 1999, 1999.
  5. D. Michelsanti and Z.-H. Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” in Proc. Interspeech, 2017.
  6. “An overview of noise-robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745–777, 2014.
  7. “Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr,” in Proc. LVA/ICA, 2015.
  8. “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, 2015.
  9. “Tenet: A time-reversal enhancement network for noise-robust asr,” arXiv preprint arXiv:2107.01531, 2021.
  10. “Improving noise robust automatic speech recognition with single-channel time-domain enhancement network,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7009–7013.
  11. “Speech enhancement based on deep denoising autoencoder.,” in Proc. Interspeech, 2013.
  12. “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2014.
  13. “On training targets for supervised speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 12, pp. 1849–1858, 2014.
  14. “Real-time speech enhancement using an efficient convolutional recurrent network for dual-microphone mobile phones in close-talk scenarios,” in Proc. ICASSP, 2019.
  15. “Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 1, pp. 153–167, 2016.
  16. “A theory on deep neural network based vector-to-vector regression with an illustration of its expressive power in speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, 2019.
  17. “Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification,” Speech Communication, vol. 60, pp. 13–29, 2014.
  18. “Experiments on deep learning for speech denoising,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  19. “Perception optimized deep denoising autoencoders for speech enhancement.,” in INTERSPEECH, 2016, pp. 3743–3747.
  20. “Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement,” arXiv preprint arXiv:2008.00264, 2020.
  21. “Tensor-to-vector regression for multi-channel speech enhancement based on tensor-train network,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7504–7508.
  22. “Convolutional fusion network for monaural speech enhancement,” Neural Networks, vol. 143, pp. 97–107, 2021.
  23. “Deepmmse: A deep learning approach to mmse-based noise power spectral density estimation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1404–1415, 2020.
  24. “Audio-visual speech enhancement using multimodal deep convolutional neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018.
  25. “Incorporating Symbolic Sequential Modeling for Speech Enhancement,” in Proc. Interspeech, 2019.
  26. “Speech denoising with deep feature losses,” arXiv preprint arXiv:1806.10522, 2018.
  27. “Coarse-to-fine optimization for speech enhancement,” arXiv preprint arXiv:1908.08044, 2019.
  28. “Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement,” arXiv preprint arXiv:2010.15174, 2020.
  29. “Improving speech enhancement through fine-grained speech characteristics,” arXiv preprint arXiv:2207.00237, 2022.
  30. “Perceptual loss based speech denoising with an ensemble of audio pattern recognition and self-supervised models,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7118–7122.
  31. “Spectral feature mapping with mimic loss for robust speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5609–5613.
  32. “Perceptual loss with recognition model for single-channel enhancement and robust asr,” arXiv preprint arXiv:2112.06068, 2021.
  33. “Gated recurrent fusion with joint training framework for robust end-to-end speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 198–209, 2020.
  34. “Jointly adversarial enhancement training for robust end-to-end speech recognition.,” in Interspeech, 2019, pp. 491–495.
  35. “Speech enhancement using end-to-end speech recognition objectives,” in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019, pp. 234–238.
  36. “Incorporating broad phonetic information for speech enhancement,” arXiv preprint arXiv:2008.07618, 2020.
  37. P. Ladefoged and K. Johnson, A course in phonetics, Nelson Education, 2014.
  38. “Using broad phonetic group experts for improved speech recognition,” IEEE transactions on audio, speech, and language processing, vol. 15, no. 3, pp. 803–812, 2007.
  39. “An artificial neural network approach to automatic speech processing,” Neurocomputing, vol. 140, pp. 326 – 338, 2014.
  40. “Improving mandarin tone recognition based on dnn by combining acoustic and articulatory features using extended recognition networks,” Journal of Signal Processing Systems, 02 2018.
  41. “Exploiting deep neural networks for detection-based speech recognition,” Neurocomputing, vol. 106, pp. 148 – 157, 2013.
  42. “Transfer learning of articulatory information through phone information,” in Proc. Interspeech, 2020.
  43. C. Lopes and F. Perdigão, “Broad phonetic class definition driven by phone confusions,” EURASIP Journal on Advances in Signal Processing, vol. 2012, no. 1, pp. 158, 2012.
  44. “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  45. “The kaldi speech recognition toolkit,” in Proc. ASRU, 2011.
  46. “The pytorch-kaldi speech recognition toolkit,” in In Proc. of ICASSP, 2019.
  47. “Attention is all you need,” in Proc. NeurIPS, 2017.
  48. “An attention-based neural network approach for single channel speech enhancement,” in Proc. ICASSP, 2019.
  49. “Characterizing speech adversarial examples using self-attention u-net enhancement,” in Proc. ICASSP. IEEE, 2020.
  50. “Speech enhancement using self-adaptation and multi-head self-attention,” in Proc. ICASSP, 2020.
  51. “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  52. “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” in INTERSPEECH, 2020.
  53. J. S Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
  54. M Huang, “Development of taiwan mandarin hearing in noise test,” Department of speech language pathology and audiology, National Taipei University of Nursing and Health science, 2005.
  55. G. Hu, “100 nonspeech environmental sounds,” The Ohio State University, Department of Computer Science and Engineering, 2004.
  56. “Boosting objective scores of a speech enhancement model by metricgan post-processing,” in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2020, pp. 455–459.
  57. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  58. Yi Hu and Philipos C Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on audio, speech, and language processing, vol. 16, no. 1, pp. 229–238, 2007.
  59. Anthony Zhang, “Speech recognition (version 3.8)[software],” in Proceedings of ICCC, 2017.
  60. “Learning spectral mapping for speech dereverberation and denoising,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 6, pp. 982–992, 2015.
  61. “Two-stage deep learning for noisy-reverberant speech enhancement,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 1, pp. 53–62, 2018.
  62. “Improved lite audio-visual speech enhancement,” arXiv preprint arXiv:2008.13222, 2020.
Citations (1)

Summary

We haven't generated a summary for this paper yet.