- The paper introduces WhisperKit, an on-device ASR system that uses architectural modifications, ANE acceleration, and OD-MBP compression.
- It achieves a near state-of-the-art word error rate of 2% and reduces latency by up to 65%, ensuring efficient streaming.
- The approach compresses model size from 1.6GB to 0.6GB while preserving performance, making it ideal for edge devices.
This paper introduces WhisperKit, an optimized on-device inference system for real-time ASR utilizing billion-scale Transformers. The system achieves state-of-the-art accuracy and latency, outperforming leading cloud-based ASR systems. The key innovations include architectural modifications for streaming inference, ANE acceleration, and a novel compression technique called OD-MBP.
Architectural Modifications for Streaming ASR
The primary challenge in real-time ASR lies in processing streaming audio with minimal latency while maintaining high accuracy. WhisperKit addresses this by modifying the Whisper architecture for native streaming support. The Audio Encoder is adapted to process partial audio, and the Text Decoder is optimized to produce accurate output streams incrementally.
To enable streaming in the Audio Encoder, the authors employ self-distillation with a relaxed block-diagonal attention mask. Unlike strict lower-triangular masks, the block-diagonal approach, denoted as d750, balances accuracy and latency. This configuration retains WER within 1% of the original model while reducing latency by 65% (602 ms -> 218 ms). This approach also enables silence caching, where the audio encoder's output for zero-padded blocks is precomputed and reused.
The Text Decoder is optimized using the LocalAgreement policy, which confirms hypothesis text by identifying the longest common prefix between consecutive text buffers. This leads to two output streams: a confirmed text stream for stable results and a hypothesis text stream for low-latency, responsive output with occasional corrections. Speculative decoding with a lightweight RNN drafter is explored but not implemented due to the overhead associated with running the drafter model, particularly for the Whisper Large v3. Turbo variant.
On-Device Optimization and ANE Acceleration
WhisperKit is designed to operate efficiently on edge devices, addressing the energy and memory constraints associated with billion-scale models. The system leverages the Apple Neural Engine (ANE) for accelerated inference, building upon the ane-transformers
reference implementation.
Stateful Models in Core ML are used to optimize the key-value cache of the Text Decoder, reducing latency by 45% (8.4 ms -> 4.6 ms on M3 ANE) and energy consumption by 75% (1.5W -> 0.3W). These optimizations are crucial for maintaining battery life and thermal stability during on-device inference.
Model Compression with Outlier-Decomposed Mixed-Bit Palettization (OD-MBP)
To reduce model size while preserving accuracy, WhisperKit employs a novel compression technique called OD-MBP. This method decomposes weight tensors into a low-bit precision inlier block and a float16 precision outlier block. Inliers are palettized and stored as a low-bit lookup table, while outliers are kept in float16 precision and stored in a bit-packed sparse representation.
This approach retains WER within 1% of the original uncompressed model while reducing the model file size from 1.6 GB to 0.6 GB. The inlier branch is implemented using the natively-accelerated palettization format, and the outlier branch is implemented using the natively-accelerated sparse weight format.
Benchmarking and Results
WhisperKit is benchmarked against leading cloud-based ASR systems, including OpenAI's gpt-4o-transcribe
, Deepgram's nova-3
, and Fireworks' large-v3-turbo
. The evaluation focuses on both latency and accuracy, considering the impact of retrospective corrections in real-time transcription.
Latency is measured as the difference between the audio cursor and the transcript cursor, using ground-truth word-level timestamps from the TIMIT dataset. WhisperKit achieves a mean latency of 0.46 seconds for hypothesis text, matching Fireworks and outperforming Deepgram (0.83 seconds) and OpenAI (1.72 seconds for confirmed text). For confirmed text, all systems achieve similar latencies around 1.7 seconds.
Accuracy is evaluated using WER for confirmed text and the number of corrections for hypothesis text. WhisperKit and Deepgram achieve the lowest WER of 2%, while Fireworks trails at 4.72%. OpenAI demonstrates zero corrections due to the lack of hypothesis text support. Fireworks issues significantly more corrections than WhisperKit and Deepgram, making it less desirable for real-time applications despite its low latency.
Conclusion
WhisperKit presents a practical and efficient solution for on-device real-time ASR with billion-scale Transformers. The system's architectural modifications, ANE acceleration, and OD-MBP compression technique enable state-of-the-art accuracy and latency while meeting the energy and memory constraints of edge devices. The results demonstrate that WhisperKit outperforms leading cloud-based ASR systems, making it a viable solution for various commercial applications.