Personalized Speech recognition on mobile devices (1603.03185v2)

Published 10 Mar 2016 in cs.CL, cs.LG, and cs.SD

Abstract: We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single LLM for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the LLM on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.

Citations (181)

View on Semantic Scholar

Summary

The paper presents a compact and efficient architecture for running large vocabulary speech recognition locally on mobile devices using quantized LSTM acoustic models and compression techniques.
Using quantization and compression, the system achieves a 13.5% word error rate and runs 7x faster than real-time, reducing the acoustic model size from 11.9MB to 3MB.
Personalization using user data like contact names effectively reduces word error rates, enabling broader application in mobile, wearable, and IoT embedded systems.

An Expert Overview of Personalized Speech Recognition on Mobile Devices

The paper, "Personalized Speech Recognition on Mobile Devices," presents a compact and efficient solution for running large vocabulary speech recognition systems locally on mobile devices. This research incidentally focuses on addressing the reliability issues tied to cloud-based speech recognition, which can suffer from high latency or failure due to unstable network connections. The authors employ a quantized Long Short-Term Memory (LSTM) acoustic model along with sophisticated compression techniques to ensure the system operates within the memory and computational constraints of mobile devices, specifically the Nexus 5 Android smartphone.

The paper discusses notable methodologies applied in the development of this system, including the use of Connectionist Temporal Classification (CTC) to train LSTM models. These models are further optimized through state-level minimum Bayes risk (sMBR) techniques. A key highlight is the application of Singular Value Decomposition (SVD) to compress model size significantly. Quantization of model parameters to an 8-bit representation contributes substantially to reducing the acoustic model's footprint from 11.9 MB to 3 MB, allowing for accelerated runtime without considerable performance degradation.

Numerical Results and System Performance

The system demonstrates a 13.5% word error rate (WER) for the open-ended dictation task, achieving a median processing speed seven times faster than real-time. The careful selection and integration of model parameters are evident, with the trade-off between model size and performance meticulously optimized. Comparatively, server-sized CTC models offer lower WERs but are less feasible for mobile deployment due to their larger size and computational requirements.

LLMing and Personalization Techniques

LLMs are tailored to serve both dictation and voice command domains simultaneously. The research introduces Bayesian interpolation methods for LLM training, optimizing the model beyond linear interpolation techniques. On-device LLMing further benefited from the LSTM-based grapheme-to-phoneme (G2P) system that streamlines pronunciation generation to a compact size, from 70 MB in an FST setup to just 500 KB.

Incorporating user-specific data like contact names results in a notable reduction in WER, demonstrating the effectiveness of personalization in speech recognition. This is achieved by injecting dynamic vocabulary items and biasing the LLM in real-time to accommodate context-dependent information.

Implications and Future Directions

This research contributes significantly to the ongoing development of embedded speech recognition systems by offering a practical architecture optimized for mobile environments. As such platforms become increasingly prevalent, the techniques proposed may find more extensive application not only in smartphones but potentially in wearable devices and IoT setups, aligning with the trend towards decentralized, real-time processing systems.

Future avenues may explore further refinements in model quantization and compression methods, potentially leveraging emerging hardware accelerations for neural network inference. Research might also delve into enhancing personalization aspects, with adaptive learning mechanisms that seamlessly integrate user-specific contexts and preferences, thereby increasing recognition accuracy and user satisfaction.

In summary, this paper articulates both technical rigor and a nuanced approach to mobile-centric speech recognition challenges, paving the way for high-performance localized speech processing capabilities on constrained devices.