Enhancer-Guided Intelligibility Prediction
- The paper demonstrates that integrating parallel processing of noisy and enhanced signals with transformer-based architectures significantly improves intelligibility prediction accuracy.
- It employs ensemble strategies of ZipEnhancer and MP-SENet, achieving reduced RMSE and enhanced NCC compared to traditional non-intrusive methods.
- The framework incorporates listener-specific audiogram embedding and a 2-clips augmentation strategy to boost generalizability for real-world hearing aid evaluation.
Enhancer-guided non-intrusive intelligibility prediction refers to methodologies in which the intelligibility of speech for hearing-impaired listeners is estimated without access to clean reference signals, by leveraging the output of speech enhancement models (enhancers) alongside modern neural architectures and listener-specific adaptations. This paradigm has become critical for assessing and optimizing hearing aid performance in real-world environments, where reference conditions are unavailable and the acoustic scene is complex. Recent research demonstrates that integrating parallel enhanced-signal pathways, cross-attention mechanisms, and acoustic-audiogram fusion leads to superior prediction robustness and generalizability (Cao et al., 21 Sep 2025).
1. Architectural Principles of Enhancer-Guided Prediction
The central design principle involves parallel processing of both the noisy (real-world) and enhanced (speech enhancer output) signals. These two signal pathways are each fed through a shared feature extractor, typically a Speech Foundation Model (SFM) pre-trained on large speech corpora. The extracted representations are then aligned temporally, often average-pooled and downsampled for computational tractability, before being fed into transformer-based architectures (Cao et al., 21 Sep 2025). Key blocks of these frameworks include:
- Temporal Transformer: Applies self-attention within each channel (noisy/enhanced) and cross-attention between them, allowing the model to integrate intelligibility cues derived from the enhancer with those present in the raw signal.
- Layer Transformer and Audiogram Embedding: Listener-specific audiogram profiles are projected and concatenated with the acoustic features to personalize prediction, subsequently refined by attention over the layer/channel dimension.
- Final Projection: Channel-specific outputs are averaged and projected through a linear + sigmoid layer to produce an intelligibility estimate in percent.
This approach enables enhanced signals to serve as an explicit surrogate for “clean” or more intelligible speech, guiding the model to focus on aspects relevant to the improvement or preservation of intelligibility.
2. Speech Enhancer Selection and Ensemble Strategies
Three state-of-the-art enhancers are evaluated for their impact on downstream intelligibility prediction:
| Enhancer | WB-PESQ (↑) | Remark |
|---|---|---|
| ZipEnhancer | 3.69 | Highest single-model accuracy |
| MP-SENet | 3.60 | Good synergy in ensemble |
| FRCRN | 3.23 | Below baseline in this setup |
Prediction performance, measured by RMSE and NCC, strongly depends on the speech enhancer’s quality. The ensemble of ZipEnhancer and MP-SENet consistently yields the best results by providing richer, complementary representations for inference. The FRCRN enhancer does not perform as well in the ensemble or alone—in fact, its inclusion can degrade accuracy below prior challenge baselines (Cao et al., 21 Sep 2025).
3. Data Augmentation: 2-Clips Strategy for Listener Variability
A notable innovation is the “2-clips augmentation” technique (Editor’s term), which increases listener-specific data diversity. For a given hearing-impaired (HI) listener, two random utterances are concatenated with a brief silence, with the combined utterance scored as the mean of its constituents. This effectively multiplies per-listener sample counts and exposes the model to a broader distribution of acoustic contexts (Cao et al., 21 Sep 2025). Such augmentation improves the robustness of the trained predictor when deployed on new datasets with different listener profiles and recording conditions.
4. Performance Metrics and Benchmarking
The main evaluation metrics are:
- RMSE (Root Mean Square Error): Assesses absolute prediction error relative to ground-truth subjective intelligibility scores.
- NCC (Normalized Pearson Cross-Correlation Coefficient): Quantifies the linear relationship between predictions and true scores.
In cross-dataset evaluations, the ZipEnhancer + MP-SENet ensemble in the proposed neural framework achieves RMSE gains (reduction by 0.82–0.94 points) and increased NCC (e.g., 0.73 versus 0.72) compared with leading non-intrusive baselines such as the CPC2 Champion. These metrics confirm the ensemble’s superiority in both accuracy and ranking consistency.
5. Integration of Listener-Specific Audiogram Information
Personalization is achieved by projecting an audiogram (the listener's hearing threshold vector across frequencies) and concatenating it with acoustic representations before attention pooling. This embedding enables the system to model the listener’s specific frequency-dependent perceptual deficits. The inclusion of audiogram information is critical; it ensures that prediction takes into account the hearing-impaired user’s unique auditory profile and not just the acoustics of the speech or enhancement artifacts (Cao et al., 21 Sep 2025).
6. Generalizability and Real-World Applicability
Unlike intrusive metrics (e.g., HASPI) that require a clean reference, enhancer-guided non-intrusive frameworks operate solely on noisy and enhanced test signals. The ensemble of diverse enhancers and the use of data augmentation are shown to boost cross-dataset and cross-listener robustness, addressing domain generalization—one of the leading challenges in practical deployment.
Applications include:
- Hearing Aid Evaluation: In-situ performance monitoring and adjustment without requiring reference speech.
- Clinical Assessment: Objective, scalable screening of intelligibility outcomes for HI listeners outside laboratory settings.
- Dynamic Monitoring: Continuous intelligibility assessment under realistic, dynamically changing noise environments.
7. Summary and Future Trends
Enhancer-guided non-intrusive intelligibility prediction frameworks establish new benchmarks for real-world assessment of hearing aid performance. By leveraging robust speech enhancement models, parallel neural pathways, audiogram-conditional attention, and ensemble strategies, these systems outperform traditional baselines on both accuracy (RMSE) and correlation (NCC). Further advances are anticipated in integrating more expressive auditory simulation, scalable data augmentation, and transformer-based long-context processing to continually bridge the laboratory-to-real-world gap in intelligibility assessment (Cao et al., 21 Sep 2025).