WeKws: A production first small-footprint end-to-end Keyword Spotting Toolkit (2210.16743v1)

Published 30 Oct 2022 in eess.AS and cs.SD

Abstract: Keyword spotting (KWS) enables speech-based user interaction and gradually becomes an indispensable component of smart devices. Recently, end-to-end (E2E) methods have become the most popular approach for on-device KWS tasks. However, there is still a gap between the research and deployment of E2E KWS methods. In this paper, we introduce WeKws, a production-quality, easy-to-build, and convenient-to-be-applied E2E KWS toolkit. WeKws contains the implementations of several state-of-the-art backbone networks, making it achieve highly competitive results on three publicly available datasets. To make WeKws a pure E2E toolkit, we utilize a refined max-pooling loss to make the model learn the ending position of the keyword by itself, which significantly simplifies the training pipeline and makes WeKws very efficient to be applied in real-world scenarios. The toolkit is publicly available at https://github.com/wenet-e2e/wekws.

Authors (7)

Jie Wang (480 papers)
Menglong Xu (9 papers)
Jingyong Hou (4 papers)
Binbin Zhang (46 papers)
Xiao-Lei Zhang (56 papers)
Lei Xie (337 papers)
Fuping Pan (11 papers)

Citations (9)

View on Semantic Scholar

Summary

Overview of WeKws: An End-to-End Keyword Spotting Toolkit

The paper presents WeKws, a novel end-to-end (E2E) keyword spotting (KWS) toolkit designed to facilitate the integration of speech-based user interfaces in smart devices. WeKws addresses the long-standing challenge of bridging research and production deployment in KWS systems, offering an alignment-free, production-ready, and lightweight solution.

Key Features of WeKws

The paper articulates several distinguishing features of WeKws that set it apart from existing KWS systems:

Alignment-Free Training: WeKws eschews the need for alignment processes, which are typical of many E2E systems, by employing a refined max-pooling loss. This approach allows the model to autonomously learn keyword ending positions, greatly simplifying the training pipeline.
Production-Ready Design: The toolkit is built with a clear focus on deployment, using causal convolutions to support streaming KWS and leveraging TorchScript for model export. This ensures compatibility with various deployment environments via Open Neural Network Exchange (ONNX) formats.
Lightweight Architecture: Purposely designed for efficiency, WeKws depends solely on PyTorch, enabling the execution of trained models on embedded devices without excessive resource consumption.
Competitive Performance: In empirical evaluations, WeKws demonstrated superior accuracy on established KWS benchmarks compared to other leading systems. Notably, these comparisons were achieved without relying on complex alignment or decoding processes prevalent in contemporary methodologies.

System Architecture and Loss Function

The system's architecture is built around three core layers: data preparation and feature extraction, model training and testing, and finally, model exportation and development. A noteworthy aspect of WeKws is the refined max-pooling loss, which plays a critical role in the model's ability to learn efficient keyword spotting without explicit alignment requirements.

The model architectural design comprises a cepstral mean and variance normalization layer, a linear transformation layer, and a customizable backbone network. The latter can be configured with RNN, TCN, or MDTC, supporting diverse application needs while maintaining a lightweight and rapid response framework.

Experimental Analysis

WeKws was tested on multiple public KWS datasets, such as Mobvoi, Snips, and Google Speech Command. Results from these experiments illustrate WeKws' proficiency, showcasing competitive False Rejection Rates (FRR) for set false alarm thresholds. The analyzed performance metrics decisively indicate that WeKws transcends many prominent systems in operational efficiency and parameter optimization.

Implications and Future Directions

The research introduces a viable open-source KWS framework that could significantly impact the deployment of smart voice-enabled systems, particularly in IoT sectors requiring on-device processing capabilities. By addressing the alignment and complexity challenges seen in other toolkits, WeKws offers a streamlined pathway from research innovation to practical application.

Future developments in this domain could benefit from enhancing the robustness of WeKws to accommodate more extensive datasets and diverse environmental noise conditions. Additionally, exploring further integration with emerging AI deployment platforms might broaden WeKws' applicability across different device ecosystems.

In conclusion, WeKws stands as a significant advancement in the toolkit offerings for E2E KWS, promoting a balance between high-performance benchmarks and practical deployment scalability.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - wenet-e2e/wekws: Production First and Production Ready End-to-End Keyword Spotting Toolkit (395 stars)