iMTI-Net: Multi-target Speech Intelligibility
- The paper introduces iMTI-Net, leveraging uncertainty-aware Whisper embeddings and sLSTM to achieve superior non-intrusive speech intelligibility prediction.
- It integrates spectral, acoustic, and statistical uncertainty features to enable robust multitask learning from human scores and ASR error metrics.
- Empirical results show iMTI-Net outperforms previous models with higher correlations on measures like WER and STOI, enhancing sensitivity across diverse speech conditions.
iMTI-Net is an improved multi-target intelligibility prediction network designed for non-intrusive assessment of speech intelligibility. Leveraging uncertainty-aware Whisper embeddings, convolutional neural networks (CNNs), and a scalar long short-term memory (sLSTM) architecture within a multitask learning framework, iMTI-Net predicts both human and machine-based measures of intelligibility, including word error rates (WER) obtained from large-vocabulary automatic speech recognition (ASR) models. The architecture provides consistent improvements over the original MTI-Net across several evaluation metrics.
1. Architectural Overview
iMTI-Net integrates three complementary feature extraction modules and advances sequential modeling with a CNN-sLSTM backbone. The model processes a speech waveform via:
- Spectral Features: Extraction using the short-time Fourier transform (STFT) for time-frequency domain representations.
- Acoustic Features: Derived from learnable filter banks using a sinc-based convolutional network (LFB), yielding features that capture additional signal characteristics.
- Uncertainty-aware Whisper Features: Embeddings from the Whisper model are further distilled into statistical summaries—mean, standard deviation, and entropy—to provide a richer, uncertainty-informed representation.
After concatenation and adaptation (adapter network), these features () are passed through the sLSTM for temporal modeling. The network features separate branches for multitask outputs, including human intelligibility scores and machine-based ASR error rates.
2. Uncertainty-Aware Feature Integration
A core innovation in iMTI-Net is the extraction of statistical features from Whisper embeddings to model uncertainty, in addition to global context:
For each time frame and embedding dimension ,
- is the Whisper embedding.
- Mean:
- Standard deviation:
- Entropy: ,
The feature vector for each frame is . An adapter layer maps for downstream processing, and the concatenated representations with CNN-extracted features form .
This approach complements deterministic embedding features with uncertainty proxies, supporting more robust learning under varying acoustic and speaker conditions.
3. Temporal Modeling with Scalar LSTM (sLSTM)
To capture long-range dependencies in speech, iMTI-Net employs a scalar LSTM (sLSTM), which introduces a normalization state in addition to the standard cell state . This design mitigates gradient explosion and stabilizes the propagation of long-term information.
The sLSTM cell can be described as follows (for input and previous hidden state ):
- , ,
- , ,
- , ,
- , ,
- ,
- ,
- ,
- .
Here, is the hyperbolic tangent function, is the sigmoid function, and the gating parameters are learned.
This sLSTM variant achieves efficient long-memory dynamics and robust temporal modeling for intelligibility prediction tasks.
4. Multitask Learning Framework
iMTI-Net is trained end-to-end in a multitask fashion, jointly predicting:
- Human intelligibility scores, obtained from subjective listening tests or standardized metrics such as STOI,
- Machine-based intelligibility metrics, specifically character error rates (CER) from two distinct ASR systems: Google ASR and Whisper.
Each target is produced by a dedicated output head (attention, fully connected layers, global pooling). The training objective is a weighted sum of losses for each task:
where are scalar coefficients balancing the loss contributions.
This multitask setup leverages correlated intelligibility signals from both human and machine targets, resulting in shared feature representations that are beneficial for non-intrusive assessment.
5. Empirical Performance and Comparative Results
Experimental results indicate that iMTI-Net achieves notable improvements on multiple benchmarks when compared to the original MTI-Net:
Metric | MTI-Net (Baseline) | iMTI-Net (CNN-BLSTM) | iMTI-Net (CNN-sLSTM) |
---|---|---|---|
Human Intelligibility | LCC: 0.7630, SRCC: 0.7071 | LCC: 0.7670, SRCC: 0.7110 | LCC: 0.7817, SRCC: 0.7622 |
Whisper CER | LCC: Not listed | LCC: 0.8151 | LCC: 0.8105, SRCC: 0.8222 |
Google CER | LCC: Not listed | LCC: 0.8418 | LCC: 0.8505, SRCC: 0.8403 |
STOI | LCC: Not listed | LCC: 0.9013 | LCC: 0.9051, SRCC: 0.9150 |
- LCC: Linear Correlation Coefficient
- SRCC: Spearman’s Rank Correlation Coefficient
iMTI-Net, especially the CNN-sLSTM variant, consistently attains superior correlation with reference targets and lower mean squared error (MSE) across all tasks. Qualitative analysis confirms that iMTI-Net predictions are more evenly distributed, indicating improved sensitivity to the full spectrum of intelligibility conditions.
A plausible implication is that uncertainty modeling via Whisper embeddings and sLSTM temporal dynamics improves sensitivity to edge cases—i.e., highly degraded or very clear speech—not just mid-range conditions.
6. Design Significance and Context
iMTI-Net’s enhancements are technically significant in several respects:
- Integrating uncertainty-aware statistics with learned ASR representations adds robustness to out-of-distribution and noisy conditions, a common challenge in practical speech assessment.
- The multitask paradigm enables convergent learning from correlated but distinct intelligibility signals, leveraging both subjective (human) and objective (ASR) metrics.
- The use of an sLSTM, as opposed to conventional LSTM or BLSTM, provides normalization-driven stabilization for long-sequence tasks, which is critical for speech assessment scenarios where utterance duration and structure vary widely.
These architectural choices align with broader trends in speech processing, where uncertainty modeling and cross-domain multitask learning have become increasingly prominent.
7. Prospective Applications and Extensions
Within non-intrusive intelligibility assessment, iMTI-Net is positioned for deployment in:
- Automatic evaluation of speech enhancement or dereverberation systems,
- Real-time quality monitoring in VOIP or telecommunications platforms,
- Assessment tools for disordered speech in clinical or rehabilitation contexts,
- Robust post-processing for ASR lattice selection and confidence scoring in adverse environments.
This suggests broader impact in both human-centric and machine-assisted quality assessment systems. A plausible extension is adaptation for regression of other paralinguistic speech attributes, leveraging the uncertainty-aware multitask approach demonstrated by iMTI-Net.
In summary, iMTI-Net represents a distinct advancement in multitarget speech intelligibility prediction by unifying uncertainty-aware ASR features, CNN-sLSTM temporal modeling, and multitask optimization within a robust, empirically validated framework (Zezario et al., 3 Sep 2025).