Broadcasted Residual Learning for Efficient Keyword Spotting (2106.04140v4)
Abstract: Keyword spotting is an important research field because it plays a key role in device wake-up and user interaction on smart devices. However, it is challenging to minimize errors while operating efficiently in devices with limited resources such as mobile phones. We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load. Our method configures most of the residual functions as 1D temporal convolution while still allows 2D convolution together using a broadcasted-residual connection that expands temporal output to frequency-temporal dimension. This residual mapping enables the network to effectively represent useful audio features with much less computation than conventional convolutional neural networks. We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning and describe how to scale up the model according to the target device's resources. BC-ResNets achieve state-of-the-art 98.0% and 98.7% top-1 accuracy on Google speech command datasets v1 and v2, respectively, and consistently outperform previous approaches, using fewer computations and parameters. Code is available at https://github.com/Qualcomm-AI-research/bcresnet.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
- M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR. IEEE Computer Society, 2018, pp. 4510–4520.
- X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in CVPR. IEEE Computer Society, 2018, pp. 6848–6856.
- M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 6105–6114.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR. IEEE Computer Society, 2016, pp. 770–778.
- F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in CVPR. IEEE Computer Society, 2017, pp. 1800–1807.
- S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, and S. Ha, “Temporal convolution for real-time keyword spotting on mobile devices,” in INTERSPEECH. ISCA, 2019, pp. 3372–3376.
- X. Li, X. Wei, and X. Qin, “Small-footprint keyword spotting with multi-scale temporal convolution,” in INTERSPEECH. ISCA, 2020, pp. 1987–1991.
- S. Majumdar and B. Ginsburg, “Matchboxnet: 1d time-channel separable convolutional neural network architecture for speech commands recognition,” in INTERSPEECH. ISCA, 2020, pp. 3356–3360.
- R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” in ICASSP. IEEE, 2018, pp. 5484–5488.
- M. Xu and X. Zhang, “Depthwise separable convolutional resnet with squeeze-and-excitation blocks for small-footprint keyword spotting,” in INTERSPEECH. ISCA, 2020, pp. 2547–2551.
- P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
- S. Chang, H. Park, J. Cho, H. Park, S. Yun, and K. Hwang, “Subspectral normalization for neural audio data processing,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 850–854.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, ser. JMLR Workshop and Conference Proceedings, vol. 37. JMLR.org, 2015, pp. 448–456.
- P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” in ICLR (Workshop). OpenReview.net, 2018.
- B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in CVPR. IEEE Computer Society, 2018, pp. 8697–8710.
- E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in AAAI. AAAI Press, 2019, pp. 4780–4789.
- D. C. de Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, “A neural attention model for speech command recognition,” CoRR, vol. abs/1808.08929, 2018.
- M. Lee, J. Lee, H. J. Jang, B. Kim, W. Chang, and K. Hwang, “Orthogonality constrained multi-head attention for keyword spotting,” in ASRU. IEEE, 2019, pp. 86–92.
- O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and S. Laurenzo, “Streaming keyword spotting on mobile devices,” in INTERSPEECH. ISCA, 2020, pp. 2277–2281.
- Y. Zhuang, X. Chang, Y. Qian, and K. Yu, “Unrestricted vocabulary keyword spotting using LSTM-CTC,” in INTERSPEECH. ISCA, 2016, pp. 938–942.
- B. Kim, M. Lee, J. Lee, Y. Kim, and K. Hwang, “Query-by-example on-device keyword spotting,” in ASRU. IEEE, 2019, pp. 532–538.
- D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in INTERSPEECH. ISCA, 2019, pp. 2613–2617.
- P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
- I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent with warm restarts,” in ICLR (Poster). OpenReview.net, 2017.
- J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
- S. Lee, S. Chang, and N. Kwak, “Urnet: User-resizable residual networks with conditional gating module,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4569–4576.
- S. Chang, J. Yang, S. Park, and N. Kwak, “Broadcasting convolutional network for visual relational reasoning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 754–769.
- J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword spotters with limited and synthesized speech data,” in ICASSP. IEEE, 2020, pp. 7474–7478.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.