Boosting keyword spotting through on-device learnable user speech characteristics (2403.07802v1)

Published 12 Mar 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.

References (22)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel architecture that fuses a frozen backbone with a learnable embedding layer to personalize keyword spotting.
It demonstrates up to a 19% reduction in error rates with as few as four labeled samples per class for efficient adaptation.
The system achieves practical deployment on edge devices with under 4KB memory usage and minimal energy overhead per learning epoch.

Enhancing Keyword Spotting with On-Device Learnable User Speech Characteristics

Introduction

Keyword spotting (KWS) systems play a crucial role in enabling user interaction with technology through voice commands. The effectiveness of these systems significantly increases when they can adapt to the individual speech characteristics of their users, such as accent, pitch, or even speech impairments. However, the traditional approach of training a model once and deploying it everywhere falls short in personalized adaptation, especially within always-on, battery-powered devices constrained by computation, memory, and energy resources. Addressing this challenge, the research conducted by Cristian Cioflan, Lukas Cavigelli, and Luca Benini introduces a novel on-device learning architecture aimed at refining keyword spotting systems to understand target users better without significant resource overheads.

User-Aware Keyword Spotting Architecture

At the heart of their proposal is a system that combines a lightweight, frozen backbone for basic keyword spotting with an adaptable user-aware embedding layer. This layer learns and integrates unique speech characteristics of the user into the KWS process, improving its accuracy for the target user without the need to retrain the entire model. Specifically, the user's speech features are encoded into embeddings through a low-cost, on-device learnable mechanism, which are then fused with the backbone's output to classify the spoken keywords more accurately. This setup was observed to reduce error rates markedly, even in scenarios with limited samples available for adaptation, underscoring the efficacy of personalizing KWS systems through user embeddings.

Evaluation and Results

The architecture's performance was rigorously evaluated on the Google Speech Commands dataset, demonstrating promising improvements in error rates for unseen speakers — up to 19% reduction under certain conditions. The system exhibits remarkable few-shot learning capabilities, requiring as few as four labeled samples per class for effective adaptation. The research further explores various embedding fusion techniques, with multiplication fusion yielding the best results across different scenarios. Notably, the learning architecture maintains compatibility with TinyML constraints, evidencing minimal resource requirements for on-device adaptation: less than 4KB of memory needed and an exceptionally low energy overhead per learning epoch.

Practical Implications and Future Perspectives

This research opens up new avenues for deploying highly efficient and personalized KWS systems on edge devices. By enabling on-device adaptation with minimal resource overheads, the proposed architecture makes it feasible to deploy voice-activated systems in a wider range of applications and devices, including those severely constrained by power and computational resources. Looking ahead, the ongoing refinement of such models could further improve their efficiency and adaptability, potentially incorporating more complex user characteristics or achieving more granular personalization.

Moreover, the methodology laid out for user-aware feature learning in KWS could inspire similar approaches in other domains of edge AI, pushing the boundaries of what's possible with on-device machine learning. As edge devices continue to proliferate, the demand for personalized and adaptive systems will only grow, making research in this field ever more relevant.

Acknowledgments

This work was partly supported by the Swiss National Science Foundation under grant No 207913: TinyTrainer: On-chip Training for TinyML devices, highlighting the collaborative effort and investment in advancing the frontiers of TinyML and on-device learning technologies.

In conclusion, the introduction of on-device learnable user speech characteristics represents a significant step forward in making keyword spotting systems more user-friendly, accurate, and accessible. By efficiently leveraging the unique speech features of individual users, this research not only enhances user interaction with technology but also sets a precedent for future innovations in the field of personalized edge computing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/pulp_platform/status/1767794497214206164

https://twitter.com/ArxivSound/status/1767763593943035971

https://twitter.com/AudioAndSpeech/status/1767788455344427210