Hello Edge: Keyword Spotting on Microcontrollers (1711.07128v3)

Published 20 Nov 2017 in cs.SD, cs.CL, cs.LG, cs.NE, and eess.AS

Abstract: Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy for good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech processing algorithms. Due to its always-on nature, KWS application has highly constrained power budget and typically runs on tiny microcontrollers with limited memory and compute capability. The design of neural network architecture for KWS must consider these constraints. In this work, we perform neural network architecture evaluation and exploration for running KWS on resource-constrained microcontrollers. We train various neural network architectures for keyword spotting published in literature to compare their accuracy and memory/compute requirements. We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy. We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures. DS-CNN achieves an accuracy of 95.4%, which is ~10% higher than the DNN model with similar number of parameters.

Authors (4)

Yundong Zhang (7 papers)
Naveen Suda (13 papers)
Liangzhen Lai (21 papers)
Vikas Chandra (75 papers)

Citations (408)

View on Semantic Scholar

Summary

The paper introduces DS-CNN models that achieve 95.4% accuracy for keyword spotting on resource-constrained microcontrollers.
It compares neural architectures like DNNs, CNNs, RNNs, and CRNNs based on memory usage, operation counts, and performance under specific hardware limits.
It validates deployment on an Arm Cortex-M7 using 8-bit quantization, enabling efficient real-time inference within just 70 KB of memory.

Analysis of "Hello Edge: Keyword Spotting on Microcontrollers"

Overview

The paper "Hello Edge: Keyword Spotting on Microcontrollers" addresses the challenges and possibilities of deploying keyword spotting (KWS) systems on microcontrollers, which are constrained by limited memory and computational power. Keyword spotting is crucial for speech-based interactions in consumer electronics, allowing devices to respond to specific command words efficiently, even while operating in an always-on state.

Neural Network Architectures and Constraints

The researchers compare various neural network architectures from existing literature, such as DNNs, CNNs, RNNs, and CRNNs. They focus on evaluating these models based on their accuracy, memory footprint, and computational demand when applied to keyword spotting on microcontroller hardware. The paper reveals that:

DNNs: Though they require fewer operations, they are memory-intensive and achieve lower accuracy.
CNNs: Provide higher accuracy but at the expense of increased memory and operation count.
RNNs (including LSTM and GRU): Offer a balance between accuracy and resource usage, leveraging temporal dependencies effectively.
CRNNs: Combine the benefits of CNNs and RNNs, achieving superior accuracy with moderate resource demand.

Introduction of DS-CNN

A significant contribution of the paper is the exploration of Depthwise Separable Convolutional Neural Networks (DS-CNNs), inspired by MobileNet. The DS-CNN architecture reduces the complexity of standard convolutions, allowing deeper network designs suitable for microcontrollers with limited resources. DS-CNNs achieve an impressive accuracy of 95.4%, significantly outperforming other architectures like DNNs with a similar number of parameters.

Resource-Constrained Architecture Exploration

The paper outlines a thorough exploration of network configurations under specific hardware constraints typical for microcontroller systems. It categorizes these models into:

Small (S): Limit of 80 KB memory and 6 MOps.
Medium (M): Limit of 200 KB memory and 20 MOps.
Large (L): Limit of 500 KB memory and 80 MOps.

The DS-CNN models consistently demonstrate scalability across these constraints, maximizing accuracy while minimizing resources.

Quantization and Deployment

To further align with microcontroller capabilities, the authors employ an 8-bit quantization method for weights and activations, preserving accuracy while reducing the model size for deployment. This quantization allows fast execution with a minimal loss in accuracy.

A practical implementation is demonstrated on an Arm Cortex-M7 microcontroller, where the entire keyword spotting application, including memory for weights, activations, and feature extraction, requires around 70 KB of memory, running efficiently at 10 inferences per second.

Implications and Future Directions

The research delineates feasible strategies for deploying sophisticated neural network models on resource-constrained devices, underscoring the importance of architectural innovation in edge AI applications.

Future Developments could involve:

Further optimization techniques for microcontrollers.
Exploration of hybrid models that integrate additional neural architectures.
Real-world deployment in diverse consumer electronics to gather more extensive usability data.

Conclusion

"Hello Edge: Keyword Spotting on Microcontrollers" delivers substantial insights into optimizing neural networks for keyword spotting in constrained environments, showcasing DS-CNNs’ potential. The outcomes emphasize the role of tailored network architectures in advancing efficient AI applications on microcontroller platforms, paving the way for more adaptive and pervasive smart devices.

Related Papers

GitHub

GitHub - ARM-software/ML-KWS-for-MCU: Keyword spotting on Arm Cortex-M Microcontrollers (1,185 stars)