Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting (1703.05390v3)

Published 15 Mar 2017 in cs.CL, cs.AI, and cs.LG

Abstract: Keyword spotting (KWS) constitutes a major component of human-technology interfaces. Maximizing the detection accuracy at a low false alarm (FA) rate, while minimizing the footprint size, latency and complexity are the goals for KWS. Towards achieving them, we study Convolutional Recurrent Neural Networks (CRNNs). Inspired by large-scale state-of-the-art speech recognition systems, we combine the strengths of convolutional layers and recurrent layers to exploit local structure and long-range context. We analyze the effect of architecture parameters, and propose training strategies to improve performance. With only ~230k parameters, our CRNN model yields acceptably low latency, and achieves 97.71% accuracy at 0.5 FA/hour for 5 dB signal-to-noise ratio.

Citations (178)

View on Semantic Scholar

Summary

The paper demonstrates a CRNN model that achieves 97.71% accuracy at 0.5 FA/hr using only 230k parameters.
The integration of convolutional and recurrent layers efficiently captures both local and contextual audio features.
Hard negative mining with PCEN mel spectrograms significantly improves noise robustness and reduces false alarms.

An Evaluation of Convolutional Recurrent Neural Networks for Efficient Keyword Spotting

This paper presents an in-depth analysis and development of Convolutional Recurrent Neural Networks (CRNNs) tailored for small-footprint Keyword Spotting (KWS) systems. Keyword spotting is a critical function for human-technology interaction, requiring high accuracy with minimal computational resources. This paper successfully integrates convolutional layers and recurrent units within a compact neural architecture to optimize both detection performance and resource efficiency.

CRNN Architecture Optimization

The paper elucidates the CRNN architecture, which combines convolutional layers that capture local structure and recurrent layers for context modeling of audio signals. The choice of training strategies, predominantly centered around cross-entropy (CE) loss as opposed to connectionist temporal classification (CTC) loss, is crucial for performance enhancement in small-footprint models. The CRNN architecture analyzed in this paper contains approximately 230k parameters—sufficiently compact for practical consideration in smartphones and smart-home devices.

An essential result is the CRNN model achieving 97.71% accuracy at 0.5 false alarms per hour (FA/hr) in an environment with a signal-to-noise ratio (SNR) of 5 dB. This performance is a notable improvement over previous CNN-based KWS models under similar conditions, highlighting the efficacy of CRNNs in balancing accuracy and model size. The parameter selection process involved analyzing various configurations, delineated in Table 1, which confirmed that increasing the number of convolution filters and recurrent hidden units is more beneficial than merely adding recurrent layers.

Training Methodology

The training data comprised the keyword "TalkType" with samples derived from over 5,000 speakers, augmented with representative background noise. The use of per-channel energy normalized (PCEN) mel spectrograms as input features proved superior in cases of limited model architecture size. The paper emphasizes the importance of a well-aligned training dataset, acquired through heuristic algorithms and further enhanced by hard negative mining—a technique proven to reduce the false rejection rates by curating challenging negative samples.

Interestingly, while increasing positive samples does not significantly alter performance due to architectural constraints, the paper notes substantial gains from diversifying negative samples through hard mining techniques.

Implications and Future Directions

The CRNN approach demonstrates robust noise-handling capabilities, particularly at lower SNRs. Far-field performance, though initially diminished, can be mitigated through augmented training that accounts for impulse response variability. This flexibility suggests the adaptability of CRNN models for diverse real-world applications beyond mobile devices, such as smart home systems where far-field conditions prevail.

The findings indicate potential advancements by extending model learning capacity, either by architectural improvements or by enhancing augmentation techniques. This direction could achieve near-human-level performance given the CRNN's adaptability evidenced in the paper.

Researchers may explore extending CRNN applications beyond KWS to broader audiovisual signal processing domains, considering the architecture's effective management of temporal and frequency domain dependencies.

This paper offers a pragmatic foundation for further innovations in efficient keyword spotting architectures, balancing the trade-offs between performance, complexity, and model size. The results, methodology, and discussions provide a scaffold for future inquiry into optimizing CRNN architectures for enriched human-technology interfaces.

PDF Markdown