iMTI-Net: Multi-target Speech Intelligibility

Updated 4 September 2025

The paper introduces iMTI-Net, leveraging uncertainty-aware Whisper embeddings and sLSTM to achieve superior non-intrusive speech intelligibility prediction.
It integrates spectral, acoustic, and statistical uncertainty features to enable robust multitask learning from human scores and ASR error metrics.
Empirical results show iMTI-Net outperforms previous models with higher correlations on measures like WER and STOI, enhancing sensitivity across diverse speech conditions.

iMTI-Net is an improved multi-target intelligibility prediction network designed for non-intrusive assessment of speech intelligibility. Leveraging uncertainty-aware Whisper embeddings, convolutional neural networks (CNNs), and a scalar long short-term memory (sLSTM) architecture within a multitask learning framework, iMTI-Net predicts both human and machine-based measures of intelligibility, including word error rates (WER) obtained from large-vocabulary automatic speech recognition (ASR) models. The architecture provides consistent improvements over the original MTI-Net across several evaluation metrics.

1. Architectural Overview

iMTI-Net integrates three complementary feature extraction modules and advances sequential modeling with a CNN-sLSTM backbone. The model processes a speech waveform $Y$ via:

Spectral Features: Extraction using the short-time Fourier transform (STFT) for time-frequency domain representations.
Acoustic Features: Derived from learnable filter banks using a sinc-based convolutional network (LFB), yielding features $C$ that capture additional signal characteristics.
Uncertainty-aware Whisper Features: Embeddings from the Whisper model are further distilled into statistical summaries—mean, standard deviation, and entropy—to provide a richer, uncertainty-informed representation.

After concatenation and adaptation (adapter network), these features ( $\tilde{x}_t$ ) are passed through the sLSTM for temporal modeling. The network features separate branches for multitask outputs, including human intelligibility scores and machine-based ASR error rates.

2. Uncertainty-Aware Feature Integration

A core innovation in iMTI-Net is the extraction of statistical features from Whisper embeddings to model uncertainty, in addition to global context:

For each time frame $t$ and embedding dimension $D$ ,

$E_t \in \mathbb{R}^D$ is the Whisper embedding.
Mean: $\mu_t = \frac{1}{D} \sum_{d=1}^D E_{t,d}$
Standard deviation: $\sigma_t = \sqrt{ \frac{1}{D} \sum_{d=1}^D (E_{t,d} - \mu_t)^2 }$
Entropy: $p_t = \text{softmax}(E_t)$ , $h_t = -\sum_{d=1}^D p_{t,d} \log p_{t,d}$

The feature vector for each frame is $x_t = [E_t; \mu_t; \sigma_t; h_t] \in \mathbb{R}^{D+3}$ . An adapter layer maps $x_t$ for downstream processing, and the concatenated representations with CNN-extracted features form $\tilde{x}_t$ .

This approach complements deterministic embedding features with uncertainty proxies, supporting more robust learning under varying acoustic and speaker conditions.

3. Temporal Modeling with Scalar LSTM (sLSTM)

To capture long-range dependencies in speech, iMTI-Net employs a scalar LSTM (sLSTM), which introduces a normalization state $n_t$ in addition to the standard cell state $c_t$ . This design mitigates gradient explosion and stabilizes the propagation of long-term information.

The sLSTM cell can be described as follows (for input $\tilde{x}_t$ and previous hidden state $h_{t-1}$ ):

$\tilde{z}_t = w_z^\top \tilde{x}_t + r_z h_{t-1} + b_z$ , $z_t = \phi(\tilde{z}_t)$ ,
$\tilde{i}_t = w_i^\top \tilde{x}_t + r_i h_{t-1} + b_i$ , $i_t = \exp(\tilde{i}_t)$ ,
$\tilde{f}_t = w_f^\top \tilde{x}_t + r_f h_{t-1} + b_f$ , $f_t = \{\exp(\tilde{f}_t)\ \text{or}\ \sigma(\tilde{f}_t)\}$ ,
$\tilde{o}_t = w_o^\top \tilde{x}_t + r_o h_{t-1} + b_o$ , $o_t = \sigma(\tilde{o}_t)$ ,
$c_t = f_t \cdot c_{t-1} + i_t \cdot z_t$ ,
$n_t = f_t \cdot n_{t-1} + i_t$ ,
$\tilde{h}_t = c_t / n_t$ ,
$h_t = o_t \cdot \tilde{h}_t$ .

Here, $\phi$ is the hyperbolic tangent function, $\sigma$ is the sigmoid function, and the gating parameters $(w_*, r_*, b_*)$ are learned.

This sLSTM variant achieves efficient long-memory dynamics and robust temporal modeling for intelligibility prediction tasks.

4. Multitask Learning Framework

iMTI-Net is trained end-to-end in a multitask fashion, jointly predicting:

Human intelligibility scores, obtained from subjective listening tests or standardized metrics such as STOI,
Machine-based intelligibility metrics, specifically character error rates (CER) from two distinct ASR systems: Google ASR and Whisper.

Each target is produced by a dedicated output head (attention, fully connected layers, global pooling). The training objective is a weighted sum of losses for each task:

$L = \gamma_1 L_{\text{Int}} + \gamma_2 L_{\text{CER, Whisper}} + \gamma_3 L_{\text{CER, Google}} + \gamma_4 L_{\text{STOI}}$

where $\gamma_1, \ldots, \gamma_4$ are scalar coefficients balancing the loss contributions.

This multitask setup leverages correlated intelligibility signals from both human and machine targets, resulting in shared feature representations that are beneficial for non-intrusive assessment.

5. Empirical Performance and Comparative Results

Experimental results indicate that iMTI-Net achieves notable improvements on multiple benchmarks when compared to the original MTI-Net:

Metric	MTI-Net (Baseline)	iMTI-Net (CNN-BLSTM)	iMTI-Net (CNN-sLSTM)
Human Intelligibility	LCC: 0.7630, SRCC: 0.7071	LCC: 0.7670, SRCC: 0.7110	LCC: 0.7817, SRCC: 0.7622
Whisper CER	LCC: Not listed	LCC: 0.8151	LCC: 0.8105, SRCC: 0.8222
Google CER	LCC: Not listed	LCC: 0.8418	LCC: 0.8505, SRCC: 0.8403
STOI	LCC: Not listed	LCC: 0.9013	LCC: 0.9051, SRCC: 0.9150

LCC: Linear Correlation Coefficient
SRCC: Spearman’s Rank Correlation Coefficient

iMTI-Net, especially the CNN-sLSTM variant, consistently attains superior correlation with reference targets and lower mean squared error (MSE) across all tasks. Qualitative analysis confirms that iMTI-Net predictions are more evenly distributed, indicating improved sensitivity to the full spectrum of intelligibility conditions.

A plausible implication is that uncertainty modeling via Whisper embeddings and sLSTM temporal dynamics improves sensitivity to edge cases—i.e., highly degraded or very clear speech—not just mid-range conditions.

6. Design Significance and Context

iMTI-Net’s enhancements are technically significant in several respects:

Integrating uncertainty-aware statistics with learned ASR representations adds robustness to out-of-distribution and noisy conditions, a common challenge in practical speech assessment.
The multitask paradigm enables convergent learning from correlated but distinct intelligibility signals, leveraging both subjective (human) and objective (ASR) metrics.
The use of an sLSTM, as opposed to conventional LSTM or BLSTM, provides normalization-driven stabilization for long-sequence tasks, which is critical for speech assessment scenarios where utterance duration and structure vary widely.

These architectural choices align with broader trends in speech processing, where uncertainty modeling and cross-domain multitask learning have become increasingly prominent.

7. Prospective Applications and Extensions

Within non-intrusive intelligibility assessment, iMTI-Net is positioned for deployment in:

Automatic evaluation of speech enhancement or dereverberation systems,
Real-time quality monitoring in VOIP or telecommunications platforms,
Assessment tools for disordered speech in clinical or rehabilitation contexts,
Robust post-processing for ASR lattice selection and confidence scoring in adverse environments.

This suggests broader impact in both human-centric and machine-assisted quality assessment systems. A plausible extension is adaptation for regression of other paralinguistic speech attributes, leveraging the uncertainty-aware multitask approach demonstrated by iMTI-Net.

In summary, iMTI-Net represents a distinct advancement in multitarget speech intelligibility prediction by unifying uncertainty-aware ASR features, CNN-sLSTM temporal modeling, and multitask optimization within a robust, empirically validated framework (Zezario et al., 3 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Speech Intelligibility Assessment with Uncertainty-Aware Whisper Embeddings and sLSTM (2025)

iMTI-Net: Multi-target Speech Intelligibility

1. Architectural Overview

2. Uncertainty-Aware Feature Integration

3. Temporal Modeling with Scalar LSTM (sLSTM)

4. Multitask Learning Framework

5. Empirical Performance and Comparative Results

6. Design Significance and Context

7. Prospective Applications and Extensions

Whiteboard

Follow Topic

Continue Learning

iMTI-Net: Multi-target Speech Intelligibility

1. Architectural Overview

2. Uncertainty-Aware Feature Integration

3. Temporal Modeling with Scalar LSTM (sLSTM)

4. Multitask Learning Framework

5. Empirical Performance and Comparative Results

6. Design Significance and Context

7. Prospective Applications and Extensions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics