Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 110 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Broadcasted Residual Learning for Efficient Keyword Spotting (2106.04140v4)

Published 8 Jun 2021 in cs.SD, cs.LG, and eess.AS

Abstract: Keyword spotting is an important research field because it plays a key role in device wake-up and user interaction on smart devices. However, it is challenging to minimize errors while operating efficiently in devices with limited resources such as mobile phones. We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load. Our method configures most of the residual functions as 1D temporal convolution while still allows 2D convolution together using a broadcasted-residual connection that expands temporal output to frequency-temporal dimension. This residual mapping enables the network to effectively represent useful audio features with much less computation than conventional convolutional neural networks. We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning and describe how to scale up the model according to the target device's resources. BC-ResNets achieve state-of-the-art 98.0% and 98.7% top-1 accuracy on Google speech command datasets v1 and v2, respectively, and consistently outperform previous approaches, using fewer computations and parameters. Code is available at https://github.com/Qualcomm-AI-research/bcresnet.

References (29)

Citations (112)

View on Semantic Scholar

Collections

Summary

Broadcasted Residual Learning for Efficient Keyword Spotting

The paper "Broadcasted Residual Learning for Efficient Keyword Spotting" introduces a novel method called broadcasted residual learning, designed to enhance the performance of convolutional neural network (CNN)-based approaches in keyword spotting (KWS). The primary objective is to achieve high accuracy with a minimal model size and computational overhead, which is crucial for applications deployed on resource-constrained devices such as mobile phones.

Broadcasted Residual Learning

This method addresses the challenges associated with using either 1D temporal or 2D frequency-temporal convolutions in the domain of KWS by combining their advantages. In traditional CNN approaches, using either 1D or 2D convolutions independently presents trade-offs between computational efficiency and capturing frequency domain features. The proposed broadcasted residual learning method innovatively integrates both 1D and 2D features within the same framework. This is achieved by performing frequency-wise 2D convolutions, followed by averaging to obtain 1D temporal features that are further used in residual mapping via broadcasting. This strategy allows the model to benefit from the frequency-awareness of 2D convolutions while retaining the computational efficiency of 1D convolutions.

The BC-ResNet Architecture

Building on the foundation of broadcasted residual learning, the authors propose a new network architecture—Broadcasting-residual network (BC-ResNet), which efficiently scales according to device constraints. BC-ResNets are organized into stages comprising BC-ResBlocks, which include depthwise separable convolutions and subspectral normalization to maintain frequency-aware learning. This architecture is designed to leverage the computational advantages of 1D operations while deploying 2D convolutions only where most beneficial.

A scalable framework, BC-ResNet extends from small-scale networks with around 10k parameters to larger models, fitting diverse computational requirements. Notably, BC-ResNet models demonstrate a strategic scaling by modifying channel width, allowing straightforward adaptation to different resource constraints without complex design changes.

Evaluation and Results

BC-ResNets deliver remarkable performance across Google speech command datasets v1 and v2. They report state-of-the-art top-1 accuracies of 98.0% and 98.7%, respectively. Impressively, these results are maintained while reducing computational demand significantly compared to existing models such as TC-ResNet, TENet, and MatchboxNet. For instance, BC-ResNet-1 achieves comparable performance to TC-ResNet with an order of magnitude fewer parameters and similar computational load.

Implications and Future Directions

The results highlight the potential of broadcasted residual learning in enhancing the efficiency of neural networks for edge devices. By effectively balancing model complexity and computational requirements, BC-ResNets pave the way for more adaptive and scalable neural architectures in keyword spotting and potentially other audio processing tasks as well.

Future research could explore further optimization of broadcasted residual learning across diverse domains and tasks beyond KWS. Additionally, integrating this technique with other neural network enhancements, such as attention mechanisms or pruning strategies, could yield even more efficient models suitable for a broader spectrum of resource-limited applications. The framework's generalizability across neural architectures promises intriguing possibilities for future innovations in efficient deep learning models.