Papers
Topics
Authors
Recent
Search
2000 character limit reached

Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches

Published 19 Jun 2025 in cs.LG | (2506.16528v1)

Abstract: Traditional ASR metrics like WER and CER fail to capture intelligibility, especially for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types, often producing errors like phoneme repetitions and imprecise consonants, yet the meaning remains clear to human listeners. We identify two key challenges: (1) Existing metrics do not adequately reflect intelligibility, and (2) while LLMs can refine ASR output, their effectiveness in correcting ASR transcripts of dysarthric speech remains underexplored. To address this, we propose a novel metric integrating Natural Language Inference (NLI) scores, semantic similarity, and phonetic similarity. Our ASR evaluation metric achieves a 0.890 correlation with human judgments on Speech Accessibility Project data, surpassing traditional methods and emphasizing the need to prioritize intelligibility over error-based measures.

Summary

  • The paper presents a novel integrated metric that combines phonetic, semantic, and NLI measures to better capture human intelligibility in ASR outputs.
  • It shows that traditional metrics like WER can misrepresent intelligibility, especially for dysarthric speech, while the new metric achieves a higher correlation with human judgments.
  • The study also explores LLM-based correction, demonstrating that phonetic similarity is a significant predictor of post-processing effectiveness in enhancing ASR performance.

Integrated Intelligibility Metrics for ASR: Evaluation with Human and LLM Judgments

The paper presents a rigorous examination of Automatic Speech Recognition (ASR) evaluation for dysarthric speech and proposes a new metric that combines phonetic similarity, semantic similarity, and natural language inference (NLI)-based entailment. This integrated approach is explicitly motivated by documented failures of traditional metrics such as Word Error Rate (WER) and Character Error Rate (CER) to capture the true intelligibility of ASR outputs, particularly for atypical speech where human listeners can often recover intended meaning despite significant lexical and phonetic deviations.

Limitations of Traditional ASR Metrics

Word-level metrics like WER and CER remain standard in ASR benchmarking but are shown to severely penalize transcripts with surface-level differences that do not impair human understanding. Dysarthric and dysphonic speech, which often include phoneme repetitions and imprecise consonants, are particularly problematic for these metrics. Empirically, the authors highlight cases with extremely high WERs that are nevertheless rated as fully intelligible by human annotators.

Semantic similarity measures such as BERTScore and BLEURT represent a partial improvement, better aligning with the information content of hypotheses and references. However, the analysis presents cases where these metrics assign high similarity to logically contradictory statements, exposing their limitations in intelligibility assessment for ASR outputs.

Experimental Analysis on Dysarthric Speech

Utilizing the Speech Accessibility Project (SAP) corpus—a large and diverse repository of dysarthric speech—the authors benchmark multiple ASR systems: Wav2vec 2.0 variants and Whisper. The systems are evaluated under varying impairment severity, and across a suite of metrics capturing word-level, phonetic, and semantic similarity.

Key findings include:

  • WER and phonetic similarity can diverge substantially from semantic similarity; for some ASR outputs, high phonetic similarity does not guarantee high semantic similarity, and vice versa.
  • Fine-tuning on dysarthric speech dramatically affects performance, with models such as wav2vec-sap1005 exhibiting both lower WER and higher phonetic similarity compared to baseline models.
  • Hybrid metrics like Heval provide a modest improvement but do not substantially resolve the underlying alignment problems with human judgment.

LLM Correction and Correctability

A core contribution of the work is an empirical study of ASR correctability via LLMs, including GPT-4 and WavLLM. LLM-based correction is examined both in aggregate and under oracle settings (where only improvements are counted).

The analysis demonstrates:

  • LLM correction can both reduce and increase WER, with improvements often stemming from the correction of phonetically close nonwords to target words, and failures resulting in hallucinated hypotheses.
  • Phonetic similarity of hypotheses is weakly but significantly predictive of LLM correctability, implying transcripts closer in sound to the reference are more likely to be correctly restored by LLM post-processing.
  • Whisper outputs, characterized by higher semantic similarity to the reference, are less amenable to LLM correction—suggesting that LLMs add the most value when baseline ASR outputs substantially diverge at the phonetic level but are still near the semantic target.

Integrated Metric: Phonetic, Semantic, and NLI Components

The principal methodological advance is an integrated ASR evaluation metric:

  • Formulation: The metric is a weighted linear combination of NLI entailment probability (using a RoBERTa-large model fine-tuned on SNLI, MNLI, FEVER, ANLI), BERTScore, and phonetic similarity (Soundex+Jaro-Winkler).
  • Calibration: Weights are learned via linear regression from 100 ASR-reference pairs with ratings from six human annotators, using 5-fold cross-validation to avoid overfitting.
  • Results: The integrated metric achieves 0.890 Pearson correlation with human intelligibility judgments, substantially higher than any single constituent metric.

The final learned weights assign the highest importance to the NLI score (α = 0.40), followed by phonetic similarity (γ = 0.32), and semantic similarity (β = 0.28). All coefficients are statistically significant, supporting their independent contribution to the overall score.

Implications and Future Directions

This work presents strong numerical evidence that a composite metric, integrating logical entailment (NLI), phonetic similarity, and semantic similarity, aligns more closely with human-perceived intelligibility than traditional or hybrid metrics. The findings contradict the assumption that maximizing word-level accuracy or even semantic similarity suffices for ASR evaluation in impaired speech conditions.

Practically, the integrated metric enables:

  • More accurate benchmarking of ASR systems on atypical or impaired speech, supporting both model improvement and deployment decisions in accessibility-focused applications.
  • Guidance in designing post-ASR correction pipelines, as systems producing outputs with high phonetic similarity are more effectively improved by LLMs.
  • Objective evaluation for LLM-assisted ASR workflows, offering a foundation for measuring gains from end-to-end ASR+LLM systems.

The study also illustrates that NLI-based inference is a salient, underutilized criterion in ASR evaluation, capturing logical relationships that go beyond both word and semantic similarity. This has wider implications for any task in which meaning preservation, rather than literal string match, is the evaluation target—such as machine translation, summarization, and dialogue.

Speculation on Future Developments

As LLMs become increasingly intertwined with ASR pipelines, both as post-processors and as integral components, evaluation metrics will need to further evolve. This work suggests potential for:

  • Task-specific tuning of NLI models for ASR intelligibility, with domain adaptation for specific populations or speech disorders.
  • Differentiated metrics for evaluating correctability, with explicit modeling of LLM hallucination phenomena.
  • Development of metrics that integrate closed-loop human-in-the-loop ratings, leveraging the relatively high inter-annotator agreement observed.

Finally, the focus on accessibility-oriented ASR evaluation introduces a methodological benchmark for future research targeting underserved speech communities, ensuring that progress in ASR models translates into real-world improvements in communication and inclusion for individuals with speech impairments.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.