Deep Residual Learning for Small-Footprint Keyword Spotting (1710.10361v2)

Published 28 Oct 2017 in cs.CL

Abstract: We explore the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as our benchmark. Our best residual network (ResNet) implementation significantly outperforms Google's previous convolutional neural networks in terms of accuracy. By varying model depth and width, we can achieve compact models that also outperform previous small-footprint variants. To our knowledge, we are the first to examine these approaches for keyword spotting, and our results establish an open-source state-of-the-art reference to support the development of future speech-based interfaces.

Citations (228)

View on Semantic Scholar

Summary

The paper demonstrates that applying ResNet architectures to keyword spotting achieves a top accuracy of 95.8%, significantly outperforming earlier CNN models.
It employs dilated convolutions across residual blocks to efficiently process one-second audio inputs while reducing model parameters and computational cost.
The approach enables high-accuracy on-device speech recognition, addressing privacy concerns and meeting the demands of performance-limited devices.

Deep Residual Learning for Small-Footprint Keyword Spotting

This paper presents an innovative exploration of deep residual learning and dilated convolutions for the task of keyword spotting using the Google Speech Commands Dataset. The researchers, Tang and Lin, leverage residual networks (ResNets) to outperform previous convolutional neural network (CNN) architectures in terms of accuracy, while maintaining a compact model footprint.

Core Contributions

The primary contribution of this paper is the application of ResNet architectures to the keyword spotting domain, marking a departure from traditional CNN approaches. The ResNet models reported here achieve exceptional accuracy, with their best model reaching 95.8%—significantly surpassing Google's prior best CNN, which achieved 91.7%. This enhancement in accuracy is accompanied by substantial reductions in model parameters and computational multiplies, crucial for deployment on performance-limited devices.

Methodology

The researchers employed various ResNet architectures, exploring both depth and width variations. The base model, res15, comprises six residual blocks, achieving superior performance with 238K parameters and 894M multiplies. For compact models, res8-narrow achieved competitive accuracy (90.1%) while drastically reducing parameters (19.9K).

Dilated convolutions were employed to extend receptive fields, allowing the model to consider the entirety of a one-second input with fewer layers. This methodological choice supports efficient, thorough data processing without a corresponding increase in model complexity.

Evaluation

The paper evaluates model performance on the Google Speech Commands Dataset, a widely-accepted benchmark. The models were assessed on accuracy and receiver operating characteristic (ROC) curves, demonstrating significant improvements over prior models. Specifically, the res8 variant outperformed Google's tpool2 model, achieving better accuracy with a more compact footprint.

Implications

The findings have impactful implications for the development of speech-based interfaces, especially in privacy-concerned applications. The ability to perform accurate on-device keyword spotting without relying on cloud computation addresses privacy issues inherent in transferring audio data.

Future Directions

The paper suggests future exploration into recurrent neural network architectures for the keyword spotting task, comparing their efficacy against the presented ResNet models. Given the lack of public implementations and benchmarks for these architectures, this work opens avenues for future exploration.

Conclusion

This research underscores the potential of residual networks in compact and efficient keyword spotting systems. By providing open-source models and benchmarks, it facilitates ongoing advancement in low-power device applications, expanding the practical reach of deep learning solutions in speech recognition tasks.

PDF Markdown

Related Papers

GitHub

GitHub - castorini/honk: PyTorch implementations of neural network models for keyword spotting (517 stars)

YouTube

Show All Videos