Boosting keyword spotting through on-device learnable user speech characteristics (2403.07802v1)
Abstract: Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.
- Benchmarking tinyml systems: Challenges and direction. arXiv abs/2003.04821 (2021).
- Towards On-device Domain Adaptation for Noise-Robust Keyword Spotting. In 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). 82–85. https://doi.org/10.1109/AICAS54282.2022.9869990
- A neural attention model for speech command recognition. arXiv abs/1808.08929 (2018).
- LETR: A Lightweight and Efficient Transformer for Keyword Spotting. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7987–7991. https://doi.org/10.1109/ICASSP43922.2022.9747295 ISSN: 2379-190X.
- Paralinguistic features communicated through voice can affect appraisals of confidence and evaluative judgments. J. Nonverbal Behav. 45, 4 (July 2021), 479–504.
- Matthew B. Hoy. 2018. Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants. Medical Reference Services Quarterly 37, 1 (2018), 81–88.
- DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation. 3819–3823. https://doi.org/10.21437/Interspeech.2023-1028
- On-Device Training Under 256KB Memory. In Annual Conference on Neural Information Processing Systems (NeurIPS).
- Drone Audition: Sound Source Localization Using On-Board Microphones. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 508–519. https://doi.org/10.1109/TASLP.2022.3140550
- S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 30 (dec 2021), 404–413. https://doi.org/10.1109/TASLP.2021.3134566
- Multilingual Spoken Words Corpus. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=c20jiJ5K2H
- Small-Footprint Keyword Spotting on Raw Audio Data with Sinc-Convolutions. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7454–7458. https://doi.org/10.1109/ICASSP40776.2020.9053395
- Reduced precision floating-point optimization for Deep Neural Network On-Device Learning on microcontrollers. Future Generation Computer Systems 149 (2023), 212–226. https://doi.org/10.1016/j.future.2023.07.020
- Adapting TTS models For New Speakers using Transfer Learning. arXiv abs/2110.05798 (2021).
- TinyOL: TinyML with Online-Learning on Microcontrollers. In 2021 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN52387.2021.9533927
- Vega: A Ten-Core SoC for IoT Endnodes With DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode. IEEE Journal of Solid-State Circuits 57, 1 (2022), 127–139. https://doi.org/10.1109/JSSC.2021.3114881
- Advancing RNN Transducer Technology for Speech Recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5654–5658. https://doi.org/10.1109/ICASSP39728.2021.9414716
- Robust Continuous On-Device Personalization for Automatic Speech Recognition. In Proc. Interspeech 2021. 1284–1288. https://doi.org/10.21437/Interspeech.2021-318
- X-Vectors: Robust DNN Embeddings for Speaker Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375
- Speaker Adaptation for End-to-End Speech Recognition Systems in Noisy Environments. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 1–6. https://doi.org/10.1109/ASRU57964.2023.10389710
- Pete Warden. 2018. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv abs/1804.03209 (2018).
- Hello Edge: Keyword Spotting on Microcontrollers. arXiv abs/1711.07128 (2017).