AraSpot: Arabic Spoken Command Spotting (2303.16621v2)
Abstract: Spoken keyword spotting (KWS) is the task of identifying a keyword in an audio stream and is widely used in smart devices at the edge in order to activate voice assistants and perform hands-free tasks. The task is daunting as there is a need, on the one hand, to achieve high accuracy while at the same time ensuring that such systems continue to run efficiently on low power and possibly limited computational capabilities devices. This work presents AraSpot for Arabic keyword spotting trained on 40 Arabic keywords, using different online data augmentation, and introducing ConformerGRU model architecture. Finally, we further improve the performance of the model by training a text-to-speech model for synthetic data generation. AraSpot achieved a State-of-the-Art SOTA 99.59% result outperforming previous approaches.
- On authorship authentication of Arabic articles. In 2014 5th International Conference on Information and Communication Systems (ICICS). 1–6. https://doi.org/10.1109/IACS.2014.6841973
- Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting. https://doi.org/10.48550/ARXIV.1703.05390
- Audio Search Based on Keyword Spotting in Arabic Language. International Journal of Advanced Computer Science and Applications 5 (01 2014), 128. https://doi.org/10.14569/IJACSA.2014.050219
- Lina Benamer and Osama Alkishriwo. 2020. Database for Arabic Speech Commands Recognition.
- Sentiment analysis in Arabic: A review of the literature. Ain Shams Engineering Journal 9 (07 2017). https://doi.org/10.1016/j.asej.2017.04.007
- Using proxies for OOV keywords in the keyword search task. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. 416–421. https://doi.org/10.1109/ASRU.2013.6707766
- Efficient keyword spotting using dilated convolutions and gating. https://doi.org/10.48550/ARXIV.1811.07684
- Freesound Datasets: A Platform for the Creation of Open Audio Datasets.
- Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spotting. Engineering Applications of Artificial Intelligence 102 (2021), 104267. https://doi.org/10.1016/j.engappai.2021.104267
- Conformer: Convolution-augmented Transformer for Speech Recognition. https://doi.org/10.48550/ARXIV.2005.08100
- A binaural room impulse response database for the evaluation of dereverberation algorithms. In 2009 16th International Conference on Digital Signal Processing. 1–5. https://doi.org/10.1109/ICDSP.2009.5201259
- Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home. 379–383. http://www.isca-speech.org/archive/Interspeech_2017/pdfs/1510.PDF
- A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5220–5224. https://doi.org/10.1109/ICASSP.2017.7953152
- Training Neural Speech Recognition Systems with Synthetic Speech Augmentation. https://doi.org/10.48550/ARXIV.1811.00707
- Training Keyword Spotters with Limited and Synthesized Speech Data. https://doi.org/10.48550/ARXIV.2002.01322
- Keyword spotting for Google assistant using contextual speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 272–278. https://doi.org/10.1109/ASRU.2017.8268946
- Rapid and accurate spoken term detection. Proc. Interspeech, 314–317. https://doi.org/10.21437/Interspeech.2007-174
- SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Interspeech 2019. ISCA. https://doi.org/10.21437/interspeech.2019-2680
- Conformer-Based on-Device Streaming Speech Recognition with KD Compression and Two-Pass Architecture. In 2022 IEEE Spoken Language Technology Workshop (SLT). 92–99. https://doi.org/10.1109/SLT54892.2023.10023291
- Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. https://doi.org/10.48550/ARXIV.1710.07654
- WaveGlow: A Flow-based Generative Network for Speech Synthesis. https://doi.org/10.48550/ARXIV.1811.00002
- Data Augmentation for Robust Keyword Spotting under Playback Interference. arXiv:1808.00563 [cs.CL]
- Continuous hidden Markov modeling for speaker-independent word spotting. In International Conference on Acoustics, Speech, and Signal Processing,. 627–630 vol.1. https://doi.org/10.1109/ICASSP.1989.266505
- R.C. Rose and D.B. Paul. 1990. A hidden Markov model based keyword recognition system. In International Conference on Acoustics, Speech, and Signal Processing. 129–132 vol.1. https://doi.org/10.1109/ICASSP.1990.115555
- Speech Recognition with Augmented Synthesized Speech. https://doi.org/10.48550/ARXIV.1909.11699
- Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting. IEEE Access 9 (2021), 80682–80691. https://doi.org/10.1109/ACCESS.2021.3078715
- Attention-based End-to-End Models for Small-Footprint Keyword Spotting. https://doi.org/10.48550/ARXIV.1803.10916
- Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. https://doi.org/10.48550/ARXIV.1712.05884
- An End-to-End Conformer-Based Speech Recognition Model for Mandarin Radiotelephony Communications in Civil Aviation. In Biometric Recognition, Weihong Deng, Jianjiang Feng, Di Huang, Meina Kan, Zhenan Sun, Fang Zheng, Wenfeng Wang, and Zhaofeng He (Eds.). Springer Nature Switzerland, Cham, 335–347.
- Connor Shorten and Taghi Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6 (07 2019). https://doi.org/10.1186/s40537-019-0197-0
- Char2Wav: End-to-End Speech Synthesis. In International Conference on Learning Representations.
- Compressed Time Delay Neural Network for Small-footprint Keyword Spotting. In Interspeech 2017. https://www.amazon.science/publications/compressed-time-delay-neural-network-for-small-footprint-keyword-spotting
- Building and evaluation of a real room impulse response dataset. IEEE Journal of Selected Topics in Signal Processing 13, 4 (aug 2019), 863–876. https://doi.org/10.1109/jstsp.2019.2917582
- Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762
- Oriol Vinyals and Steven Wegmann. 2014. Chasing the metric: Smoothing learning algorithms for keyword detection. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3301–3305. https://doi.org/10.1109/ICASSP.2014.6854211
- Gated Convolutional LSTM for Speech Commands Recognition. 669–681. https://doi.org/10.1007/978-3-319-93701-4_53
- Tacotron: Towards End-to-End Speech Synthesis. https://doi.org/10.48550/ARXIV.1703.10135
- Pete Warden. 2018. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. https://doi.org/10.48550/ARXIV.1804.03209
- M. Weintraub. 1993. Keyword-spotting using SRI’s DECIPHER large-vocabulary speech-recognition system. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2. 463–466 vol.2. https://doi.org/10.1109/ICASSP.1993.319341
- Image Data Augmentation for Deep Learning: A Survey. https://doi.org/10.48550/ARXIV.2204.08610
- Deokgyu Yun and Seung Ho Choi. 2022. Deep Learning-Based Estimation of Reverberant Environment for Audio Data Augmentation. Sensors 22, 2 (2022). https://doi.org/10.3390/s22020592
- Mengjun Zeng and Nanfeng Xiao. 2019. Effective Combination of DenseNet and BiLSTM for Keyword Spotting. IEEE Access 7 (2019), 10767–10775. https://doi.org/10.1109/ACCESS.2019.2891838
- Conformer-Based Target-Speaker Automatic Speech Recognition For Single-Channel Audio. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://doi.org/10.1109/icassp49357.2023.10095115