Papers
Topics
Authors
Recent
2000 character limit reached

Error Level Noise (ELN) for Robust ASR

Updated 26 December 2025
  • ELN is a vector-based representation that quantifies semantic and token-level disagreement across multiple ASR hypotheses to identify noise-induced transcription errors.
  • It integrates sentence-level and token-level embedding differences via compact computations, enabling LLMs to effectively focus on unreliable transcription segments.
  • Experimental results demonstrate that ELN-enhanced error correction pipelines achieve significant reductions in Word Error Rate across varied noise conditions and languages.

Error Level Noise (ELN) is a representation developed for robust automatic speech recognition (ASR) in noisy environments, particularly targeting low-resource languages where state-of-the-art ASR models such as Whisper exhibit significant degradation under reduced signal-to-noise ratios (SNRs). ELN quantifies the semantic and token-level disagreement across multiple ASR hypotheses produced under noise. By encoding these disagreements into compact embeddings, ELN directly measures noise-induced uncertainty, enabling downstream LLMs to more effectively identify and correct unreliable segments of transcriptions when performing ASR error correction. Incorporating ELN into LLM-assisted correction pipelines yields marked reductions in Word Error Rate (WER), establishing its utility as a principled, noise-sensitive, and language-agnostic measure of hypothesis reliability (Rahmani et al., 19 Dec 2025).

1. Motivation for ELN

Noisy or low-SNR speech conditions lead even advanced ASR systems to generate divergent best hypotheses, resulting in breakdowns of simple fusion strategies such as majority voting (ROVER) or n-best consensus selection. Such methods presume token-wise independence and majority correctness, assumptions frequently violated by noise-induced linguistic distortions. LLM-based error correction models that consume only the raw text outputs of ASR systems lack effective priors on which text regions are noise-contaminated, often leading to hallucination or amplification of ASR errors. ELN addresses this limitation by constructing explicit, vector-based representations of cross-hypothesis disagreement, thereby providing the correction model with a noise-sensitive, content-aligned signal for robust error mitigation (Rahmani et al., 19 Dec 2025).

2. Mathematical Specification of ELN

Given nn ASR hypotheses H={H1,,Hn}\mathcal{H} = \{ H_1, \dots, H_n \} (with n=5n = 5 in reported experiments), ELN is computed in two principal stages: sentence-level and token-level, followed by vector concatenation.

2.1 Sentence-Level ELN

Each hypothesis HiH_i is encoded into a dd-dimensional vector ei\mathbf{e}_i using a pre-trained sentence embedder (e.g., Sentence-BERT):

ei=Embedsent(Hi)Rd\mathbf{e}_i = \mathrm{Embed}_{\mathrm{sent}}(H_i) \in \mathbb{R}^d

The sentence-level ELN vector is then the average pairwise squared difference:

vsent=2n(n1)1i<jn(eiej)2Rd\mathbf{v}_\mathrm{sent} = \frac{2}{n(n-1)} \sum_{1 \le i < j \le n} (\mathbf{e}_i - \mathbf{e}_j)^{\circ 2} \in \mathbb{R}^d

where ()2(\cdot)^{\circ 2} denotes element-wise squaring.

2.2 Token-Level ELN

Hypotheses are tokenized and padded to a common length Lmax=maxiLiL_{\max} = \max_i L_i. Each token ti,kt_{i,k} is embedded:

ti,k=Embedtok(ti,k)Rd\mathbf{t}_{i,k} = \mathrm{Embed}_{\mathrm{tok}}(t_{i,k}) \in \mathbb{R}^{d'}

Token-level ELN is calculated as the average pairwise squared difference over tokens and hypotheses:

vtok=1Lmaxk=1Lmax2n(n1)1i<jn(ti,ktj,k)2Rd\mathbf{v}_\mathrm{tok} = \frac{1}{L_{\max}} \sum_{k=1}^{L_{\max}} \frac{2}{n(n-1)} \sum_{1 \le i < j \le n} (\mathbf{t}_{i,k} - \mathbf{t}_{j,k})^{\circ 2} \in \mathbb{R}^{d'}

2.3 Final ELN Vector

The complete ELN embedding concatenates both components:

vELN=[vsent    vtok]Rd+d\mathbf{v}_\mathrm{ELN} = [ \mathbf{v}_\mathrm{sent} \;\Vert\; \mathbf{v}_\mathrm{tok} ] \in \mathbb{R}^{d + d'}

Its L2L_2-norm,

vELN2=i=1d+d(vELN,i)2\|\mathbf{v}_\mathrm{ELN}\|_2 = \sqrt{ \sum_{i=1}^{d + d'} (\mathbf{v}_{\mathrm{ELN},i})^2 }

correlates strongly with WER, serving as an empirical indicator of utterance difficulty.

3. Integration of ELN Embeddings into LLM-Based Correction

The ASR correction model employs a fine-tuned LLaMA-2-7B with LoRA adapters. ELN vectors are mapped to the model’s token embedding space via an MLP:

p=MLP(vELN)RE\mathbf{p} = \mathrm{MLP}(\mathbf{v}_\mathrm{ELN}) \in \mathbb{R}^E

where EE is the LLM’s embedding dimension. In a prefix-tuning paradigm, copies of p\mathbf{p} are prepended to the input token embeddings derived from the concatenated top-5 ASR hypotheses, allowing the LLM to condition decoding on the ELN-derived uncertainty signal. During training, standard teacher-forced cross-entropy loss is employed between the model output and the ground-truth transcription.

Injection Workflow

Stage Input(s) Transformation
ASR hypothesis generation Noisy speech audio Whisper nn-best hypotheses
ELN embedding computation {H1,,H5}\{ H_1, \dots, H_5 \} vELN\mathbf{v}_\mathrm{ELN}
Embedding projection vELN\mathbf{v}_\mathrm{ELN} Prefix p\mathbf{p} via MLP
Model input assembly p\mathbf{p}, token embeddings of H\mathcal{H} Input embedding sequence
Correction via LLM Input embeddings Output transcription

4. Training Protocol and Hyperparameters

  • Model: LLaMA-2-7B (4-bit quantized) with LoRA adapters.
  • Data: Top-5 hypotheses and corresponding ELN vectors, paired to ground-truth Persian text.
  • Optimization: AdamW optimizer, learning rate 2×1042 \times 10^{-4}, cosine scheduler, weight decay.
  • Procedure: Trained for 3 epochs with gradient accumulation and checkpointing of LoRA weights.
  • Objective: Cross-entropy loss (teacher-forcing) for transcription sequence prediction.

5. Effects on Word Error Rate (WER)

Quantitative evaluation demonstrates the impact of ELN-enhanced models on various noise regimes:

Method Clean Mixed Noise SNR = 5 dB SNR = 10 dB
Raw Whisper 24.80 31.10 42.70 38.30
Base LLaMA2 (zero-shot) 62.43 64.58 70.63 67.75
Fine-tuned (no ELN) 24.06 30.79 39.76 31.59
Fine-tuned + ELN (Ours) 24.39 24.84 32.34 28.02

On the Mixed Noise set, ELN conditioning reduces WER from 30.79% to 24.84%, a relative 19% reduction compared to the fine-tuned text-only baseline. For SNR = 5 dB, WER falls from 39.76% to 32.34%. A cross-lingual evaluation on VB-DEMAND (English) shows an improvement from 7.93% to 3.96%, outperforming the RobustGER baseline (from 10.70% to 13.00%). Ablation studies confirm that the performance boost is attributable to ELN: fine-tuned text-only models do not achieve similar gains. A strong correlation is observed between the norm of ELN vectors and per-utterance WER, highlighting ELN’s sensitivity to noise-induced transcriptional uncertainty.

6. Analysis, Significance, and Applications

ELN offers a principled, extensible mechanism to signal hypothesis disagreement both at the semantic and token levels, enabling LLMs to become explicitly aware of the reliability of transcriptions under noise. By integrating a compact, information-rich summary of cross-hypothesis variance into LLM-based correction models via differentiable adapters and prefix-tuning, ELN facilitates robust, scalable enhancement of noisy-speech ASR—particularly for low-resource languages and challenging acoustic scenarios. The observed reduction in WER across both Persian and English benchmarks positions ELN as a foundational building block for future research in noise-robust multimodal sequence modeling and uncertainty-aware LLM adaptation (Rahmani et al., 19 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Error Level Noise (ELN).