Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AraSpot: Arabic Spoken Command Spotting (2303.16621v2)

Published 29 Mar 2023 in cs.CL and cs.AI

Abstract: Spoken keyword spotting (KWS) is the task of identifying a keyword in an audio stream and is widely used in smart devices at the edge in order to activate voice assistants and perform hands-free tasks. The task is daunting as there is a need, on the one hand, to achieve high accuracy while at the same time ensuring that such systems continue to run efficiently on low power and possibly limited computational capabilities devices. This work presents AraSpot for Arabic keyword spotting trained on 40 Arabic keywords, using different online data augmentation, and introducing ConformerGRU model architecture. Finally, we further improve the performance of the model by training a text-to-speech model for synthetic data generation. AraSpot achieved a State-of-the-Art SOTA 99.59% result outperforming previous approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. On authorship authentication of Arabic articles. In 2014 5th International Conference on Information and Communication Systems (ICICS). 1–6. https://doi.org/10.1109/IACS.2014.6841973
  2. Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting. https://doi.org/10.48550/ARXIV.1703.05390
  3. Audio Search Based on Keyword Spotting in Arabic Language. International Journal of Advanced Computer Science and Applications 5 (01 2014), 128. https://doi.org/10.14569/IJACSA.2014.050219
  4. Lina Benamer and Osama Alkishriwo. 2020. Database for Arabic Speech Commands Recognition.
  5. Sentiment analysis in Arabic: A review of the literature. Ain Shams Engineering Journal 9 (07 2017). https://doi.org/10.1016/j.asej.2017.04.007
  6. Using proxies for OOV keywords in the keyword search task. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. 416–421. https://doi.org/10.1109/ASRU.2013.6707766
  7. Efficient keyword spotting using dilated convolutions and gating. https://doi.org/10.48550/ARXIV.1811.07684
  8. Freesound Datasets: A Platform for the Creation of Open Audio Datasets.
  9. Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spotting. Engineering Applications of Artificial Intelligence 102 (2021), 104267. https://doi.org/10.1016/j.engappai.2021.104267
  10. Conformer: Convolution-augmented Transformer for Speech Recognition. https://doi.org/10.48550/ARXIV.2005.08100
  11. A binaural room impulse response database for the evaluation of dereverberation algorithms. In 2009 16th International Conference on Digital Signal Processing. 1–5. https://doi.org/10.1109/ICDSP.2009.5201259
  12. Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home. 379–383. http://www.isca-speech.org/archive/Interspeech_2017/pdfs/1510.PDF
  13. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5220–5224. https://doi.org/10.1109/ICASSP.2017.7953152
  14. Training Neural Speech Recognition Systems with Synthetic Speech Augmentation. https://doi.org/10.48550/ARXIV.1811.00707
  15. Training Keyword Spotters with Limited and Synthesized Speech Data. https://doi.org/10.48550/ARXIV.2002.01322
  16. Keyword spotting for Google assistant using contextual speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 272–278. https://doi.org/10.1109/ASRU.2017.8268946
  17. Rapid and accurate spoken term detection. Proc. Interspeech, 314–317. https://doi.org/10.21437/Interspeech.2007-174
  18. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Interspeech 2019. ISCA. https://doi.org/10.21437/interspeech.2019-2680
  19. Conformer-Based on-Device Streaming Speech Recognition with KD Compression and Two-Pass Architecture. In 2022 IEEE Spoken Language Technology Workshop (SLT). 92–99. https://doi.org/10.1109/SLT54892.2023.10023291
  20. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. https://doi.org/10.48550/ARXIV.1710.07654
  21. WaveGlow: A Flow-based Generative Network for Speech Synthesis. https://doi.org/10.48550/ARXIV.1811.00002
  22. Data Augmentation for Robust Keyword Spotting under Playback Interference. arXiv:1808.00563 [cs.CL]
  23. Continuous hidden Markov modeling for speaker-independent word spotting. In International Conference on Acoustics, Speech, and Signal Processing,. 627–630 vol.1. https://doi.org/10.1109/ICASSP.1989.266505
  24. R.C. Rose and D.B. Paul. 1990. A hidden Markov model based keyword recognition system. In International Conference on Acoustics, Speech, and Signal Processing. 129–132 vol.1. https://doi.org/10.1109/ICASSP.1990.115555
  25. Speech Recognition with Augmented Synthesized Speech. https://doi.org/10.48550/ARXIV.1909.11699
  26. Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting. IEEE Access 9 (2021), 80682–80691. https://doi.org/10.1109/ACCESS.2021.3078715
  27. Attention-based End-to-End Models for Small-Footprint Keyword Spotting. https://doi.org/10.48550/ARXIV.1803.10916
  28. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. https://doi.org/10.48550/ARXIV.1712.05884
  29. An End-to-End Conformer-Based Speech Recognition Model for Mandarin Radiotelephony Communications in Civil Aviation. In Biometric Recognition, Weihong Deng, Jianjiang Feng, Di Huang, Meina Kan, Zhenan Sun, Fang Zheng, Wenfeng Wang, and Zhaofeng He (Eds.). Springer Nature Switzerland, Cham, 335–347.
  30. Connor Shorten and Taghi Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6 (07 2019). https://doi.org/10.1186/s40537-019-0197-0
  31. Char2Wav: End-to-End Speech Synthesis. In International Conference on Learning Representations.
  32. Compressed Time Delay Neural Network for Small-footprint Keyword Spotting. In Interspeech 2017. https://www.amazon.science/publications/compressed-time-delay-neural-network-for-small-footprint-keyword-spotting
  33. Building and evaluation of a real room impulse response dataset. IEEE Journal of Selected Topics in Signal Processing 13, 4 (aug 2019), 863–876. https://doi.org/10.1109/jstsp.2019.2917582
  34. Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762
  35. Oriol Vinyals and Steven Wegmann. 2014. Chasing the metric: Smoothing learning algorithms for keyword detection. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3301–3305. https://doi.org/10.1109/ICASSP.2014.6854211
  36. Gated Convolutional LSTM for Speech Commands Recognition. 669–681. https://doi.org/10.1007/978-3-319-93701-4_53
  37. Tacotron: Towards End-to-End Speech Synthesis. https://doi.org/10.48550/ARXIV.1703.10135
  38. Pete Warden. 2018. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. https://doi.org/10.48550/ARXIV.1804.03209
  39. M. Weintraub. 1993. Keyword-spotting using SRI’s DECIPHER large-vocabulary speech-recognition system. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2. 463–466 vol.2. https://doi.org/10.1109/ICASSP.1993.319341
  40. Image Data Augmentation for Deep Learning: A Survey. https://doi.org/10.48550/ARXIV.2204.08610
  41. Deokgyu Yun and Seung Ho Choi. 2022. Deep Learning-Based Estimation of Reverberant Environment for Audio Data Augmentation. Sensors 22, 2 (2022). https://doi.org/10.3390/s22020592
  42. Mengjun Zeng and Nanfeng Xiao. 2019. Effective Combination of DenseNet and BiLSTM for Keyword Spotting. IEEE Access 7 (2019), 10767–10775. https://doi.org/10.1109/ACCESS.2019.2891838
  43. Conformer-Based Target-Speaker Automatic Speech Recognition For Single-Channel Audio. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://doi.org/10.1109/icassp49357.2023.10095115

Summary

We haven't generated a summary for this paper yet.