Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

11 tokens/sec

GPT-4o

12 tokens/sec

Gemini 2.5 Pro Pro

40 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

34 tokens/sec

2000 character limit reached

AF-Whisper: Neural Whisper Detection

Updated 14 July 2025

AF-Whisper is a neural whisper detection system that leverages LSTM architecture and engineered acoustic features to accurately distinguish whispered from normal speech in far-field applications.
It processes 64-dimensional log-filterbank energy features along with SRH, HFE, and ACMAX to capture unique spectral and temporal characteristics of whisper speech.
Benchmarking results show significant gains in frame accuracy and whisper recall, underscoring the system's robust performance in both lab-controlled and real-world environments.

AF-Whisper refers to a neural whisper detection system designed to distinguish whispered speech from normal phonation, especially in the context of far-field, voice-controlled devices. The system employs a Long Short-Term Memory (LSTM) architecture trained on log-filterbank energy (LFBE) features, with further enhancement provided by a suite of engineered acoustic features that capture the unique characteristics of whisper speech. AF-Whisper has been rigorously benchmarked against multilayer perceptron (MLP) models and various feature configurations, demonstrating state-of-the-art performance on both in-house and real-world datasets (1809.07832).

1. LSTM Neural Network Architecture

The core of AF-Whisper is a sequential LSTM neural network, which processes frame-level acoustic features:

Input: Each frame is represented by a 64-dimensional LFBE feature.
Network Structure: The architecture consists of two hidden LSTM layers, each with 64 memory cells.
Output Layer: Final predictions are made via a 2-dimensional softmax corresponding to "whisper" and "normal" classes.
Training: The model is trained by minimizing cross-entropy loss using stochastic gradient descent, employing backpropagation through time (BPTT) to capture both short- and long-term temporal dependencies.

This configuration enables AF-Whisper to exploit the sequential nature of audio data and to model the context-dependent differences between whisper and normal speech.

2. Log-Filterbank Energy (LFBE) and Engineered Acoustic Features

Log-Filterbank Energy (LFBE)

LFBE features are the primary input representation for the network:

Extraction: 25 ms frames with 10 ms overlap, producing a 64-dimensional vector.
Channel Mean Subtraction (CMS): Applied on a per-speaker/device basis.
Role: LFBE captures the spectral energy distribution; in whisper speech, this distribution highlights reduced energy in lower frequency bands, a key differentiator from normal phonation.

Engineered Features

Three domain-specific features are introduced to enhance discrimination:

Sum of Residual Harmonics (SRH)

$SRH(f) = E(f) + \sum_{k=2}^{N_{harm}} [ E(kf) - E((k-\tfrac{1}{2})f) ]$

Where $E(f)$ is the amplitude spectrum. SRH functions as a voicing detector, critical since whisper speech lacks periodic (F0) excitation.

High Frequency Energy (HFE)
- Ratio of energy between 6875–8000 Hz and 310–620 Hz bands.
- Shannon entropy over the low-frequency region; higher for whisper speech due to flatter spectral energy.
Auto-Correlation Peak Maximum (ACMAX)
- Maximum peak (lag and value) and mean peak distance within 80–450 Hz autocorrelation range, reflecting the absence of periodicity in whispers.

These engineered features augment the standard LFBE by explicitly encoding properties that are indicative of whispered speech.

3. Utterance-Level Inference Strategies

Because the LSTM produces frame-level posteriors, AF-Whisper investigates several schemes for robust utterance-level classification:

Last-frame: Uses the posterior from the final frame.
Window-N: Averages posteriors over the last $N$ frames.
Mean: Averages over all frames in the utterance.
Silence Removal: "Ignore-last-50" approach discards possible trailing silence frames to improve robustness.

Empirical results indicate that averaging over the entire utterance, especially with silence-pruning, yields the highest recall, as relying solely on the final frame can lead to errors caused by trailing silences.

4. Benchmarking and Performance

AF-Whisper’s performance is evaluated on both lab-controlled and real-world ("live traffic") datasets:

Frame Accuracy: With LFBE-only features, LSTM achieves 93.5% (vs. 77.1% for MLP). Inclusion of engineered features raises this to 96.0%.
Whisper Recall (in-house): Increases from 97.4% (LFBE) to 99.3% (LFBE + engineered).
False-Positive Rate: On live traffic data, using engineered features reduces FPR from 0.2% to 0.1%.
Comparative Insight: LSTM trained on LFBE alone matches or exceeds MLP with LFBE plus engineered features, reflecting the LSTM’s strength in modeling temporal dependencies. Engineered features nonetheless provide complementary gains.

Detailed quantitative metrics, including F1 scores, are tabulated and directly reported in the source.

5. Data Requirements and Variability

AF-Whisper’s development and evaluation utilize:

In-House Dataset: ~28,000 utterances, split into 23,000 for training/validation and 5,000 for testing, with strict speaker separation.
Live Traffic Dataset: 30,000 recorded utterances for evaluation, ensuring coverage of real-world acoustic conditions.
No Explicit Augmentation: Robustness is achieved through diversity in recording environments and phonation types, rather than synthetic data augmentation.
Standardization: All audio is processed at 16 kHz with channel mean subtraction, ensuring consistency.

This diverse data exposure is critical for real-world performance, especially in far-field, device-directed scenarios.

6. Incremental Value of Engineered Features

While LSTM models are capable of implicitly learning whisper characteristics from LFBE, the engineered features offer:

Absolute Error Reduction: Frame accuracy boost of about 3.5%; whisper recall gains approaching 2%.
Specificity: Clarify ambiguous cases—such as distinguishing whisper from silence or noise—where LFBE alone may be insufficient.
Example: Incorporating SRH helps the detector rule out low-energy, unvoiced non-speech as whisper due to unique harmonic absence in whispering.

Hence, the integration of domain knowledge via engineered features is validated as a significant enhancer of performance, especially in challenging acoustic settings.

7. System Implications and Broader Context

AF-Whisper exemplifies the integration of deep sequential modeling with feature engineering in acoustic event detection:

Practical Utility: Enables voice-controlled devices to recognize whispered commands, enhancing accessibility in noise-sensitive or privacy-preserving contexts.
System Workflow: Acoustic feature extraction (LFBE, SRH, HFE, ACMAX) → sequential LSTM processing → inference module for utterance-level classification.
Design Trade-Offs: LSTMs trade greater temporal modeling capacity for increased computational cost over MLPs, but this is justified by significant performance gains.
Future Expansion: Although not discussed in the source, the robust performance demonstrated suggests that extensions to other nonstandard phonation types or low-energy speech events could be plausible, leveraging the same engineering plus sequential modeling paradigm.

Overall, AF-Whisper represents a rigorously evaluated, feature-rich, LSTM-based system for the reliable detection of whispered speech in real-world environments, supported by a comprehensive benchmarking methodology and explicit feature engineering for maximal accuracy (1809.07832).

PDF Markdown Chat (Upgrade)

References (1)

LSTM-based Whisper Detection (2018)