Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Broadcasted Residual Learning for Efficient Keyword Spotting (2106.04140v4)

Published 8 Jun 2021 in cs.SD, cs.LG, and eess.AS

Abstract: Keyword spotting is an important research field because it plays a key role in device wake-up and user interaction on smart devices. However, it is challenging to minimize errors while operating efficiently in devices with limited resources such as mobile phones. We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load. Our method configures most of the residual functions as 1D temporal convolution while still allows 2D convolution together using a broadcasted-residual connection that expands temporal output to frequency-temporal dimension. This residual mapping enables the network to effectively represent useful audio features with much less computation than conventional convolutional neural networks. We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning and describe how to scale up the model according to the target device's resources. BC-ResNets achieve state-of-the-art 98.0% and 98.7% top-1 accuracy on Google speech command datasets v1 and v2, respectively, and consistently outperform previous approaches, using fewer computations and parameters. Code is available at https://github.com/Qualcomm-AI-research/bcresnet.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  2. M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR.   IEEE Computer Society, 2018, pp. 4510–4520.
  3. X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in CVPR.   IEEE Computer Society, 2018, pp. 6848–6856.
  4. M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML, ser. Proceedings of Machine Learning Research, vol. 97.   PMLR, 2019, pp. 6105–6114.
  5. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR.   IEEE Computer Society, 2016, pp. 770–778.
  6. F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in CVPR.   IEEE Computer Society, 2017, pp. 1800–1807.
  7. S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, and S. Ha, “Temporal convolution for real-time keyword spotting on mobile devices,” in INTERSPEECH.   ISCA, 2019, pp. 3372–3376.
  8. X. Li, X. Wei, and X. Qin, “Small-footprint keyword spotting with multi-scale temporal convolution,” in INTERSPEECH.   ISCA, 2020, pp. 1987–1991.
  9. S. Majumdar and B. Ginsburg, “Matchboxnet: 1d time-channel separable convolutional neural network architecture for speech commands recognition,” in INTERSPEECH.   ISCA, 2020, pp. 3356–3360.
  10. R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” in ICASSP.   IEEE, 2018, pp. 5484–5488.
  11. M. Xu and X. Zhang, “Depthwise separable convolutional resnet with squeeze-and-excitation blocks for small-footprint keyword spotting,” in INTERSPEECH.   ISCA, 2020, pp. 2547–2551.
  12. P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
  13. S. Chang, H. Park, J. Cho, H. Park, S. Yun, and K. Hwang, “Subspectral normalization for neural audio data processing,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 850–854.
  14. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, ser. JMLR Workshop and Conference Proceedings, vol. 37.   JMLR.org, 2015, pp. 448–456.
  15. P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” in ICLR (Workshop).   OpenReview.net, 2018.
  16. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in CVPR.   IEEE Computer Society, 2018, pp. 8697–8710.
  17. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in AAAI.   AAAI Press, 2019, pp. 4780–4789.
  18. D. C. de Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, “A neural attention model for speech command recognition,” CoRR, vol. abs/1808.08929, 2018.
  19. M. Lee, J. Lee, H. J. Jang, B. Kim, W. Chang, and K. Hwang, “Orthogonality constrained multi-head attention for keyword spotting,” in ASRU.   IEEE, 2019, pp. 86–92.
  20. O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and S. Laurenzo, “Streaming keyword spotting on mobile devices,” in INTERSPEECH.   ISCA, 2020, pp. 2277–2281.
  21. Y. Zhuang, X. Chang, Y. Qian, and K. Yu, “Unrestricted vocabulary keyword spotting using LSTM-CTC,” in INTERSPEECH.   ISCA, 2016, pp. 938–942.
  22. B. Kim, M. Lee, J. Lee, Y. Kim, and K. Hwang, “Query-by-example on-device keyword spotting,” in ASRU.   IEEE, 2019, pp. 532–538.
  23. D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in INTERSPEECH.   ISCA, 2019, pp. 2613–2617.
  24. P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
  25. I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent with warm restarts,” in ICLR (Poster).   OpenReview.net, 2017.
  26. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  27. S. Lee, S. Chang, and N. Kwak, “Urnet: User-resizable residual networks with conditional gating module,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4569–4576.
  28. S. Chang, J. Yang, S. Park, and N. Kwak, “Broadcasting convolutional network for visual relational reasoning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 754–769.
  29. J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword spotters with limited and synthesized speech data,” in ICASSP.   IEEE, 2020, pp. 7474–7478.
Citations (112)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

Broadcasted Residual Learning for Efficient Keyword Spotting

The paper "Broadcasted Residual Learning for Efficient Keyword Spotting" introduces a novel method called broadcasted residual learning, designed to enhance the performance of convolutional neural network (CNN)-based approaches in keyword spotting (KWS). The primary objective is to achieve high accuracy with a minimal model size and computational overhead, which is crucial for applications deployed on resource-constrained devices such as mobile phones.

Broadcasted Residual Learning

This method addresses the challenges associated with using either 1D temporal or 2D frequency-temporal convolutions in the domain of KWS by combining their advantages. In traditional CNN approaches, using either 1D or 2D convolutions independently presents trade-offs between computational efficiency and capturing frequency domain features. The proposed broadcasted residual learning method innovatively integrates both 1D and 2D features within the same framework. This is achieved by performing frequency-wise 2D convolutions, followed by averaging to obtain 1D temporal features that are further used in residual mapping via broadcasting. This strategy allows the model to benefit from the frequency-awareness of 2D convolutions while retaining the computational efficiency of 1D convolutions.

The BC-ResNet Architecture

Building on the foundation of broadcasted residual learning, the authors propose a new network architecture—Broadcasting-residual network (BC-ResNet), which efficiently scales according to device constraints. BC-ResNets are organized into stages comprising BC-ResBlocks, which include depthwise separable convolutions and subspectral normalization to maintain frequency-aware learning. This architecture is designed to leverage the computational advantages of 1D operations while deploying 2D convolutions only where most beneficial.

A scalable framework, BC-ResNet extends from small-scale networks with around 10k parameters to larger models, fitting diverse computational requirements. Notably, BC-ResNet models demonstrate a strategic scaling by modifying channel width, allowing straightforward adaptation to different resource constraints without complex design changes.

Evaluation and Results

BC-ResNets deliver remarkable performance across Google speech command datasets v1 and v2. They report state-of-the-art top-1 accuracies of 98.0% and 98.7%, respectively. Impressively, these results are maintained while reducing computational demand significantly compared to existing models such as TC-ResNet, TENet, and MatchboxNet. For instance, BC-ResNet-1 achieves comparable performance to TC-ResNet with an order of magnitude fewer parameters and similar computational load.

Implications and Future Directions

The results highlight the potential of broadcasted residual learning in enhancing the efficiency of neural networks for edge devices. By effectively balancing model complexity and computational requirements, BC-ResNets pave the way for more adaptive and scalable neural architectures in keyword spotting and potentially other audio processing tasks as well.

Future research could explore further optimization of broadcasted residual learning across diverse domains and tasks beyond KWS. Additionally, integrating this technique with other neural network enhancements, such as attention mechanisms or pruning strategies, could yield even more efficient models suitable for a broader spectrum of resource-limited applications. The framework's generalizability across neural architectures promises intriguing possibilities for future innovations in efficient deep learning models.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube