GPT-Whisper-HA Speech Intelligibility System

Updated 4 September 2025

GPT-Whisper-HA is a zero-shot, non-intrusive speech intelligibility assessment system that employs individualized auditory simulations, dual ASR transcription pipelines, and language model scoring to closely match subjective evaluations.
The system integrates MSBG and NAL-R algorithms for simulating hearing loss and utilizes dual Whisper models to provide consensus-based intelligibility scores while reducing random transcription errors.
Empirical results on the CPC 2023 dataset show reduced RMSE and improved LCC and SRCC, evidencing enhanced predictive reliability for personalized hearing aid assessments.

GPT-Whisper-HA is a zero-shot, non-intrusive speech intelligibility assessment system for hearing aids that incorporates individualized auditory simulations and dual automatic speech recognition pipelines, followed by LLM scoring and ensemble averaging. The system is designed to deliver intelligibility scores that correlate with subjective human evaluations, without requiring supervised task-specific data. GPT-Whisper-HA achieves this through a modular architecture that systematically adapts audio according to hearing loss profiles, evaluates speech clarity across two ASR models, and leverages prompt-engineered LLM scoring for robust prediction.

1. System Architecture and Processing Pipeline

GPT-Whisper-HA operates as a multi-stage pipeline comprising individualized auditory simulations, dual ASR transcription, and LLM scoring:

Auditory Simulation Modules: The input audio $A$ $A$ undergoes two transformations:
- MSBG hearing loss simulation models spectral shaping and perceptual deficits typically experienced by hearing-impaired users: $A_m = \mathrm{MSBG}(A)$ .
- NAL-R amplification algorithm applies frequency-specific gain according to the target audiogram, representing standard hearing aid fitting: $A_n = \mathrm{NAL\text{-}R}(A_m)$ .
Dual ASR Processing: The processed audio $A_n$ $A_{n}$ is transcribed independently by two Whisper models—one with lower capacity (Whisper $_s$ $_{s}$ ) and one with higher capacity (Whisper $_l$ $_{l}$ ):
- $T_s = \mathrm{Whisper}_s(A_n)$
- $T_l = \mathrm{Whisper}_l(A_n)$
- The rationale for this dual-judge approach is that consensus between models with different robustness levels is a strong proxy for intelligibility.
LLM Scoring: Each ASR transcript is evaluated by GPT-4o using a prompt designed to rate naturalness—defined as fluency, semantic coherence, and contextual appropriateness:
- $S_s = \mathrm{GPT4o}(T_s, P)$
- $S_l = \mathrm{GPT4o}(T_l, P)$
- Here, $P$ specifies the assessment rubric (e.g., rate this transcript for naturalness of expression).
Score Averaging: The final intelligibility estimate is the arithmetic mean of the two scores:
- $S_\text{GPT-Whisper} = \mathrm{ScoreAve}(S_s, S_l)$

This structure ensures a robust prediction via cross-model checking and ensemble scoring.

2. Individualized Hearing Loss Simulation

The integration of MSBG and NAL-R distinguishes GPT-Whisper-HA from generic zero-shot metrics:

MSBG (Multichannel Spectral Band Gain): Simulates core aspects of hearing loss, including frequency-dependent attenuation and impaired temporal processing, thereby approximating how users with hearing impairment perceive critical speech cues.
NAL-R (National Acoustic Laboratories—Revised): Implements standardized amplification tailored to a subject’s audiogram, as is typical in prescription hearing aids.

Processing first with MSBG and then NAL-R ( $A_m \to A_n$ ) ensures that all downstream evaluation—including ASR and LLM scoring—is performed on an audio signal faithful to the real-world sensory experience of hearing aid users. This approach moves beyond generic audio degradation metrics and enables individualized, ecologically valid intelligibility estimation.

3. Dual ASR Evaluation and Score Averaging

Deploying two Whisper ASR models with differing complexity is central to the architecture:

Whisper $_s$ (Small): Captures basic transcription robustness; sensitive to moderate signal degradation.
Whisper $_l$ (Large): Demonstrates superior performance in noisy or heavily processed audio conditions.

The outputs $T_s$ and $T_l$ serve as parallel probes of the speech signal’s clarity. Score averaging $(S_s + S_l)/2$ mitigates both random and systematic biases inherent in solo model prediction. Empirically, agreement between the models strongly indicates clear, intelligible speech, while divergence flags likely issues with audibility or quality after HA processing.

4. Naturalness Scoring via LLM

GPT-4o is prompted with each transcript and a rubric focusing on aspects of fluency, coherence, and semantic fidelity. The output score is interpreted as a measure of transcript naturalness, which in turn is treated as a proxy for speech intelligibility. The underlying assumption is that transcripts exhibiting high naturalness (minimal grammatical issues, preserved meaning, and proper phrasing) are produced from audio that is intelligible to both ASR and human listeners. This technique, originally validated in GPT-Whisper (Zezario et al., 16 Sep 2024), is improved in GPT-Whisper-HA by combining dual ASR scoring post personalized audio processing.

5. Empirical Results and Predictive Performance

Benchmarking on the CPC 2023 dataset demonstrates improved metric performance over GPT-Whisper:

System	RMSE	LCC	SRCC
GPT-Whisper	37.019	0.541	0.501
GPT-Whisper-HA	34.767	0.570	0.558

These figures confirm a relative RMSE improvement of 2.59%, along with boosted linear and rank correlation coefficients. The enhanced correlation with subjective intelligibility—and lower RMSE—results directly from the sequential application of MSBG and NAL-R, dual Whisper transcription, and GPT-4o scoring. The robust ensemble design thus advances zero-shot intelligibility prediction for HA users without additional supervised data.

6. Context, Implications, and Comparative Significance

GPT-Whisper-HA demonstrates that LLMs, when paired with contextually processed audio and consensus-based ASR evaluation, can rival or surpass supervised approaches in non-intrusive, zero-shot intelligibility prediction. The technique is particularly applicable in domains where labeled data is scarce and individualized assessment is required, such as personalized hearing aid fitting and speech enhancement benchmarking.

A plausible implication is that the use of multiple ASR judges after hearing loss simulation can generalize beyond audiology, potentially serving automated speech assessment in other accessibility or clinical contexts. The modular pipeline, leveraging standard audiological transformations and open-source ASR/LLM systems, offers extensibility for further research.

7. Limitations and Prospects

Room remains for further inquiry regarding the optimal choice and diversity of ASR models, the calibration of GPT-4o prompts for semantic sensitivity, and the range of hearing loss simulations included. MSBG and NAL-R are standard schemes, but future work may benefit from newer perceptual modeling. The ensemble averaging is robust but may be superseded by more sophisticated ensemble scoring mechanisms as more models are incorporated.

This approach also assumes that naturalness in text is a sufficient proxy for intelligibility, which is plausible but may not capture all aspects of functional comprehension in HA users. Further correlation studies with human listener test batteries will provide more granular validation.