Whispy: Real-Time ASR with Whisper
- Whispy is a real-time ASR system that adapts offline Whisper models through a streaming pipeline, enabling low-latency transcription.
- It employs a robust suggestion mechanism based on Levenshtein distance to align overlapping outputs and maintain continuous context.
- Optimized data buffering and VAD integration yield near-offline performance with minimal latency and high accuracy on modest hardware.
Whispy is a system that enables real-time automatic speech recognition (ASR) using Whisper, a family of state-of-the-art large general-purpose transformer models for speech analysis, without altering the model’s internal architecture. Whispy addresses the fundamental limitation that Whisper models are designed for offline, batch transcription and are not directly suitable for live or low-latency applications. By employing architectural optimizations at the data pipeline and post-processing levels, Whispy facilitates streaming voice recognition while maintaining high transcription accuracy, low computational cost, and operational robustness in practical deployment environments (Bevilacqua et al., 2024).
1. System Architecture and Principles
Whispy achieves real-time speech transcription by wrapping a pretrained Whisper model—typically quantized using CTranslate2 (as in faster-whisper)—inside an engineered streaming pipeline. There are no changes to the transformer core: no additional attention mechanisms, sparse-attention kernels, or modified feed-forward components are introduced.
The audio stream is continuously appended to a FIFO buffer (data register) of fixed capacity, sized as , where is the number of chunks, is the per-chunk duration in seconds, and is the sampling rate. At every interval , the buffer is re-transcribed as a whole (i.e., with overlapping windows); each window encompasses the most recent chunks. This overlapping context management is essential for capturing sentence boundaries and preserving transcription continuity despite the chunked inference procedure.
2. Context Handling: Re-Transcription and Suggestion Mechanism
To ensure coherence across partially overlapping transcriptions, Whispy employs a string-level "suggestion" algorithm based on Levenshtein distance. When a new transcription is generated for the current buffer window, the system finds the suffix of the new output () most similar to the transcript from the previous step (). Specifically, for , the minimum edit distance is computed between 0 and 1. The optimal suffix is appended to the cumulative output. This facilitates robust recovery of full-sentence context, bridging across window boundaries without incurring significant word error rate (WER) degradation.
A hallucination filter—consisting of the removal of repetitive token runs (e.g., "uh uh uh uh")—is applied as a lightweight check before the suggestion step. Empirical results indicate hallucinated outputs are rare in practice (Bevilacqua et al., 2024).
3. Streaming Pipeline and Computation
The core streaming algorithm operates in a loop with the following process:
- Obtain the latest audio chunk from the register.
- Apply voice activity detection (VAD) to the chunk; if VAD returns “false,” the register is flushed to avoid transcribing silence and the output is “silence.”
- If active, retrieve the entire window (all chunks in the buffer).
- Run Whisper transcription on the window.
- Apply hallucination filtering if required.
- Apply suggestion alignment with previous transcript.
- Update cumulative output and emit the inferred transcription.
For latency, the system defines 2, where 3 is the Whisper inference time and 4 is the VAD duration. There are inherent trade-offs: increasing chunk length (5) reduces VAD invocations but increases per-window compute and look-ahead latency; increasing buffer size (6) provides more context and marginally better WER at the expense of higher cost and small additional alignment errors.
4. Computational Complexity and Hardware Performance
Whispy reduces the computational complexity compared to standard offline Whisper. Offline Whisper's self-attention scales quadratically, 7, in sequence length. Whispy's chunked-overlapping evaluation restricts compute to 8, where 9 is total windows processed per time unit. Empirically:
- On a Tesla T4 GPU, using the large-v3 model with 4 s chunks and a buffer of 5 yields total per-chunk latency (0 + VAD + suggestion) of approximately 0.88 s (for ESIC) to 1.66 s (Rev16), corresponding to real-time factors (RTF) of 4 (1/chunk per 2 processing time).
- The small and base Whisper models support even faster operation (RTF ≈ 7–9, with total per-chunk latencies around 0.44–0.63 s).
- Each Whispy instance consumes approximately 2 GB of GPU RAM, with negligible additional CPU burden for audio stream handling and VAD (~5–10% of 8-vCPU).
- The streaming design thus sustains robust, multi-channel, real-time speech recognition on modest hardware (Bevilacqua et al., 2024).
5. Recognition Accuracy and Robustness
Empirical evaluation across standard ASR corpora (ESIC, LibriSpeech, TED-Lium, Rev16) demonstrates that Whispy achieves WER within 1–2 percentage points of offline Whisper, except on very noisy streams (e.g., Rev16, where a 5–7 pp gap appears). Statistical testing (Wilcoxon test, 3) confirms the absence of a significant difference in word accuracy between Whispy and standard Whisper for most datasets.
Critical ablations include:
- Disabling the suggestion/Levenshtein step increases WER by 43–5 pp.
- Disabling pre-chunk VAD increases compute by 515% (due to silent region transcription) with no corresponding accuracy improvement.
- Larger chunk sizes or buffer sizes expose a latency/accuracy trade-off: increasing chunk length reduces WER by encoding more context but raises system latency.
6. Practical Integration and Deployment
Whispy exposes several tunable parameters to balance latency, accuracy, and compute utilization:
- For end-to-end latency below 500 ms, choose chunk lengths 6 and 7.
- For optimal WER (\textless10%), 8, 9 is recommended.
- Aggressive VAD settings (high threshold) suppress silence errors but may miss words, while conservative settings introduce slight latency increases.
- Buffer sizes above 0 demonstrate diminishing WER improvements and may impair alignment.
Whispy is deployed as an HTTP+WebSocket service, compatible with RTP/FFmpeg and readily integrable into WebRTC stacks (Bevilacqua et al., 2024).
7. Relationship to Related Work and Methodological Distinctions
Whispy is distinguished by its strict non-intrusiveness with respect to the Whisper model network: it does not introduce architectural or parameter-level changes to Whisper, nor does it implement sparse attention or other transformer optimizations. Its novelty resides in the streaming pipeline and string-alignment-based suggestion mechanism, which consistently maintains context in an online fashion. In comparison, prior work such as Bangla-WhisperDiar (Bhuiyan et al., 6 May 2026) demonstrates fine-tuning Whisper variants for specific languages and elongated ASR, frequently involving substantial data augmentation, parameter updates, and external model modifications. Whispy, by contrast, relies solely on external data flow and pre-/post-processing optimizations, resulting in near-original model accuracy in low-latency, real-time scenarios.
Table: Key Parameters and Operational Trade-offs
| Parameter | Effect on System | Typical Value |
|---|---|---|
| Chunk length (1) | ↓ Latency, ↑ WER (shorter); ↑ Latency, ↓ WER (longer) | 2–4 s |
| Buffer size (2) | ↑ Context, ↑ Latency, marginal WER reduction | 3–5 |
| VAD aggressiveness | ↑ Silence suppression, potential ↑ word omission | High/Medium |
System performance exhibits sensitivity to these hyperparameters within the design constraints of hardware and application context.
References:
- Whispy: Adapting STT Whisper Models to Real-Time Environments (Bevilacqua et al., 2024)
- Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization (Bhuiyan et al., 6 May 2026)