Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Boosting keyword spotting through on-device learnable user speech characteristics (2403.07802v1)

Published 12 Mar 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Benchmarking tinyml systems: Challenges and direction. arXiv abs/2003.04821 (2021).
  2. Towards On-device Domain Adaptation for Noise-Robust Keyword Spotting. In 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). 82–85. https://doi.org/10.1109/AICAS54282.2022.9869990
  3. A neural attention model for speech command recognition. arXiv abs/1808.08929 (2018).
  4. LETR: A Lightweight and Efficient Transformer for Keyword Spotting. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7987–7991. https://doi.org/10.1109/ICASSP43922.2022.9747295 ISSN: 2379-190X.
  5. Paralinguistic features communicated through voice can affect appraisals of confidence and evaluative judgments. J. Nonverbal Behav. 45, 4 (July 2021), 479–504.
  6. Matthew B. Hoy. 2018. Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants. Medical Reference Services Quarterly 37, 1 (2018), 81–88.
  7. DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation. 3819–3823. https://doi.org/10.21437/Interspeech.2023-1028
  8. On-Device Training Under 256KB Memory. In Annual Conference on Neural Information Processing Systems (NeurIPS).
  9. Drone Audition: Sound Source Localization Using On-Board Microphones. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 508–519. https://doi.org/10.1109/TASLP.2022.3140550
  10. S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 30 (dec 2021), 404–413. https://doi.org/10.1109/TASLP.2021.3134566
  11. Multilingual Spoken Words Corpus. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=c20jiJ5K2H
  12. Small-Footprint Keyword Spotting on Raw Audio Data with Sinc-Convolutions. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7454–7458. https://doi.org/10.1109/ICASSP40776.2020.9053395
  13. Reduced precision floating-point optimization for Deep Neural Network On-Device Learning on microcontrollers. Future Generation Computer Systems 149 (2023), 212–226. https://doi.org/10.1016/j.future.2023.07.020
  14. Adapting TTS models For New Speakers using Transfer Learning. arXiv abs/2110.05798 (2021).
  15. TinyOL: TinyML with Online-Learning on Microcontrollers. In 2021 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN52387.2021.9533927
  16. Vega: A Ten-Core SoC for IoT Endnodes With DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode. IEEE Journal of Solid-State Circuits 57, 1 (2022), 127–139. https://doi.org/10.1109/JSSC.2021.3114881
  17. Advancing RNN Transducer Technology for Speech Recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5654–5658. https://doi.org/10.1109/ICASSP39728.2021.9414716
  18. Robust Continuous On-Device Personalization for Automatic Speech Recognition. In Proc. Interspeech 2021. 1284–1288. https://doi.org/10.21437/Interspeech.2021-318
  19. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375
  20. Speaker Adaptation for End-to-End Speech Recognition Systems in Noisy Environments. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 1–6. https://doi.org/10.1109/ASRU57964.2023.10389710
  21. Pete Warden. 2018. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv abs/1804.03209 (2018).
  22. Hello Edge: Keyword Spotting on Microcontrollers. arXiv abs/1711.07128 (2017).
Citations (1)

Summary

  • The paper introduces a novel architecture that fuses a frozen backbone with a learnable embedding layer to personalize keyword spotting.
  • It demonstrates up to a 19% reduction in error rates with as few as four labeled samples per class for efficient adaptation.
  • The system achieves practical deployment on edge devices with under 4KB memory usage and minimal energy overhead per learning epoch.

Enhancing Keyword Spotting with On-Device Learnable User Speech Characteristics

Introduction

Keyword spotting (KWS) systems play a crucial role in enabling user interaction with technology through voice commands. The effectiveness of these systems significantly increases when they can adapt to the individual speech characteristics of their users, such as accent, pitch, or even speech impairments. However, the traditional approach of training a model once and deploying it everywhere falls short in personalized adaptation, especially within always-on, battery-powered devices constrained by computation, memory, and energy resources. Addressing this challenge, the research conducted by Cristian Cioflan, Lukas Cavigelli, and Luca Benini introduces a novel on-device learning architecture aimed at refining keyword spotting systems to understand target users better without significant resource overheads.

User-Aware Keyword Spotting Architecture

At the heart of their proposal is a system that combines a lightweight, frozen backbone for basic keyword spotting with an adaptable user-aware embedding layer. This layer learns and integrates unique speech characteristics of the user into the KWS process, improving its accuracy for the target user without the need to retrain the entire model. Specifically, the user's speech features are encoded into embeddings through a low-cost, on-device learnable mechanism, which are then fused with the backbone's output to classify the spoken keywords more accurately. This setup was observed to reduce error rates markedly, even in scenarios with limited samples available for adaptation, underscoring the efficacy of personalizing KWS systems through user embeddings.

Evaluation and Results

The architecture's performance was rigorously evaluated on the Google Speech Commands dataset, demonstrating promising improvements in error rates for unseen speakers — up to 19% reduction under certain conditions. The system exhibits remarkable few-shot learning capabilities, requiring as few as four labeled samples per class for effective adaptation. The research further explores various embedding fusion techniques, with multiplication fusion yielding the best results across different scenarios. Notably, the learning architecture maintains compatibility with TinyML constraints, evidencing minimal resource requirements for on-device adaptation: less than 4KB of memory needed and an exceptionally low energy overhead per learning epoch.

Practical Implications and Future Perspectives

This research opens up new avenues for deploying highly efficient and personalized KWS systems on edge devices. By enabling on-device adaptation with minimal resource overheads, the proposed architecture makes it feasible to deploy voice-activated systems in a wider range of applications and devices, including those severely constrained by power and computational resources. Looking ahead, the ongoing refinement of such models could further improve their efficiency and adaptability, potentially incorporating more complex user characteristics or achieving more granular personalization.

Moreover, the methodology laid out for user-aware feature learning in KWS could inspire similar approaches in other domains of edge AI, pushing the boundaries of what's possible with on-device machine learning. As edge devices continue to proliferate, the demand for personalized and adaptive systems will only grow, making research in this field ever more relevant.

Acknowledgments

This work was partly supported by the Swiss National Science Foundation under grant No 207913: TinyTrainer: On-chip Training for TinyML devices, highlighting the collaborative effort and investment in advancing the frontiers of TinyML and on-device learning technologies.

In conclusion, the introduction of on-device learnable user speech characteristics represents a significant step forward in making keyword spotting systems more user-friendly, accurate, and accessible. By efficiently leveraging the unique speech features of individual users, this research not only enhances user interaction with technology but also sets a precedent for future innovations in the field of personalized edge computing.