Papers
Topics
Authors
Recent
Search
2000 character limit reached

Audio-Grounded LSEC for Speaker Label Correction

Updated 2 June 2026
  • The paper demonstrates that integrating acoustic speaker scores with lexical corrections significantly reduces word diarization error rates, with early fusion achieving up to a 41% reduction over baselines.
  • AG-LSEC employs both early and late fusion mechanisms to combine acoustic and lexical information, enhancing corrections at challenging overlap and speaker-change regions.
  • Empirical results indicate that early fusion not only delivers higher error correction rates but also minimizes new speaker-labeling errors, making it efficient even in resource-constrained scenarios.

Audio-Grounded Lexical Speaker Error Correction (AG-LSEC) is a hybrid framework designed to improve word-level speaker labeling accuracy in automated speech transcription pipelines. It extends the Lexical Speaker Error Correction (LSEC) paradigm by explicitly integrating acoustic speaker scores from a neural speaker diarization subsystem, providing additional grounding to mitigate errors introduced by timing deviations and overlapping speech. The approach combines word-level lexical cues with acoustic information via early or late fusion architectures and achieves substantial reductions in Word Diarization Error Rate (WDER) across several English conversational benchmarks (Paturi et al., 2024).

1. Background and Motivation

Speaker Diarization (SD), the task of producing a transcript marked with "who spoke what when," is commonly realized as a modular pipeline comprising an audio-only SD system and an Automatic Speech Recognition (ASR) system. The SD module yields frame-level speaker posteriors or hard labels, whereas the ASR system outputs word transcripts with corresponding time stamps. A downstream reconciliation step aligns words with the most likely speaker label based on frame overlap. However, limitations arise due to errors at speaker-change points, overlapping speech, and slight misalignment between SD and ASR outputs.

To address reconciliation-induced speaker errors, LSEC leverages a transformer-based lexical LLM (LM) to post-process and correct speaker attributions at the word level using only text. While exploiting conversational turn-taking cues, LSEC's reliance on lexical context alone can result in over-corrections (lexically plausible but acoustically incorrect labels) or under-corrections, particularly in highly disfluent or overlapping regions.

AG-LSEC is motivated by the hypothesis that fusing acoustic speaker activity signals, specifically word-level "speaker scores" derived from the diarization system, with lexical correction models can yield more accurate assignments. This acoustic grounding is particularly beneficial near boundary and overlap regions, providing a complementary signal to the LLM's lexical cues.

2. Model Structure and Acoustic-Lexical Fusion

The AG-LSEC architecture builds on the SD + ASR + LSEC pipeline by introducing two fusion mechanisms for integrating acoustic speaker scores into the lexical correction process: early fusion and late fusion.

2.1 Acoustic Speaker Score Extraction

An End-to-End Neural Diarization (EEND) system processes input acoustic features {xt}t=1T\{\boldsymbol x_t\}_{t=1}^T to estimate frame-level speaker posteriors:

(p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S

where SS is the number of speakers. The diarization network is trained using permutation-free binary cross-entropy:

Ldiar=1TSminϕperm(S)t=1TBCE(ytϕ,pt).\mathcal L_{\mathrm{diar}} = \frac{1}{T S} \min_{\phi \in \mathrm{perm}(S)} \sum_{t=1}^T \mathrm{BCE}(\boldsymbol y_t^\phi, \boldsymbol p_t).

Post-processing applies a median filter to smooth the posteriors:

p^s,t=median(ps,tM2,,ps,t+M2)\hat p_{s,t} = \mathrm{median}(p_{s,t-\frac{M}{2}}, \dots, p_{s,t+\frac{M}{2}})

Word-level pooling is performed over ASR-derived boundaries ti,startt_{i,\mathrm{start}} to ti,endt_{i,\mathrm{end}}:

as,i=1ti,endti,start+1t=ti,startti,endp^s,ta_{s,i} = \frac{1}{t_{i,\mathrm{end}} - t_{i,\mathrm{start}} + 1} \sum_{t=t_{i,\mathrm{start}}}^{t_{i,\mathrm{end}}} \hat p_{s,t}

across all speakers, then normalized:

a^s,i=as,is=1Sas,i\hat a_{s,i} = \frac{a_{s,i}}{\sum_{s'=1}^S a_{s',i}}

The resulting a^i=(a^1,i,,a^S,i)\hat{\boldsymbol a}_i = (\hat a_{1,i}, \dots, \hat a_{S,i}) forms the set of word-level acoustic speaker scores for each word (p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S0.

2.2 Early Fusion

In early fusion, each word (p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S1 is tokenized into sub-words, and context-sensitive sub-word embeddings (p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S2 are obtained via a pre-trained LLM (e.g., RoBERTa). The word-level speaker score (p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S3 is concatenated to the embedding for the first sub-word of (p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S4, and a "don't care" vector is used for subsequent sub-words. The concatenated vectors

(p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S5

are passed through a Transformer encoder, after which a softmax layer produces per-sub-word speaker posteriors. The final word-level speaker posteriors (p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S6 are taken from the first sub-word of each word.

2.3 Late Fusion

Late fusion first runs standard LSEC over the transcript, yielding lexical speaker posteriors (p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S7. For each word, the acoustic speaker score (p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S8 is concatenated to the lexical posterior and input to a small feed-forward neural network ("FusionNet"):

(p1,,pT)=fEEND(x1,,xT),pt={ps,t}s=1S(\boldsymbol p_1,\dots,\boldsymbol p_T) = f_{\mathrm{EEND}}(\boldsymbol x_1,\dots,\boldsymbol x_T), \quad \boldsymbol p_t = \{p_{s,t}\}_{s=1}^S9

FusionNet yields the final speaker posterior SS0 per word.

3. Training Objectives and Implementation

3.1 Loss Functions

The diarization module is trained with permutation-free binary cross-entropy loss as above. The correction head for both fusion types uses a standard cross-entropy objective over the true speaker label SS1:

SS2

No additional regularization is applied.

3.2 Training Data and Hyperparameters

Training initializes from a text-only LSEC checkpoint. Paired gold-standard data are drawn from Fisher corpus (single-channel two-speaker conversations with gold transcripts/labels). The optimizer is Adam with learning rate SS3, batch size 32, and approximately 30 words per batch for 20 epochs, distributed over 8 GPUs. Training is performed on sliding windows comprising at most two locally active speakers.

4. Evaluation: Metrics, Datasets, and Empirical Performance

4.1 Word-Diarization Error Rate (WDER)

WDER jointly penalizes ASR substitution, deletion, and insertion errors while also measuring speaker-label assignment errors at the word level. After aligning hypothesis with reference transcripts (using asclite), WDER is computed as

SS4

where SS5, SS6, and SS7 denote word-level substitutions, deletions, and insertions, respectively.

4.2 Datasets

  • Fisher: SS82000 two-speaker telephone conversations, SS9100 hours, moderate overlap (Ldiar=1TSminϕperm(S)t=1TBCE(ytϕ,pt).\mathcal L_{\mathrm{diar}} = \frac{1}{T S} \min_{\phi \in \mathrm{perm}(S)} \sum_{t=1}^T \mathrm{BCE}(\boldsymbol y_t^\phi, \boldsymbol p_t).015%).
  • CallHome American English (CHAE): Ldiar=1TSminϕperm(S)t=1TBCE(ytϕ,pt).\mathcal L_{\mathrm{diar}} = \frac{1}{T S} \min_{\phi \in \mathrm{perm}(S)} \sum_{t=1}^T \mathrm{BCE}(\boldsymbol y_t^\phi, \boldsymbol p_t).1500 calls, two speakers each, Ldiar=1TSminϕperm(S)t=1TBCE(ytϕ,pt).\mathcal L_{\mathrm{diar}} = \frac{1}{T S} \min_{\phi \in \mathrm{perm}(S)} \sum_{t=1}^T \mathrm{BCE}(\boldsymbol y_t^\phi, \boldsymbol p_t).220 hours, Ldiar=1TSminϕperm(S)t=1TBCE(ytϕ,pt).\mathcal L_{\mathrm{diar}} = \frac{1}{T S} \min_{\phi \in \mathrm{perm}(S)} \sum_{t=1}^T \mathrm{BCE}(\boldsymbol y_t^\phi, \boldsymbol p_t).310% overlap.
  • RT03-CTS: Ldiar=1TSminϕperm(S)t=1TBCE(ytϕ,pt).\mathcal L_{\mathrm{diar}} = \frac{1}{T S} \min_{\phi \in \mathrm{perm}(S)} \sum_{t=1}^T \mathrm{BCE}(\boldsymbol y_t^\phi, \boldsymbol p_t).440 calls, two speakers each, Ldiar=1TSminϕperm(S)t=1TBCE(ytϕ,pt).\mathcal L_{\mathrm{diar}} = \frac{1}{T S} \min_{\phi \in \mathrm{perm}(S)} \sum_{t=1}^T \mathrm{BCE}(\boldsymbol y_t^\phi, \boldsymbol p_t).53 hours, Ldiar=1TSminϕperm(S)t=1TBCE(ytϕ,pt).\mathcal L_{\mathrm{diar}} = \frac{1}{T S} \min_{\phi \in \mathrm{perm}(S)} \sum_{t=1}^T \mathrm{BCE}(\boldsymbol y_t^\phi, \boldsymbol p_t).612% overlap.

4.3 Empirical Results

Model Fisher RT03-CTS CHAE
Baseline (SD+ASR) 2.56 2.64 3.45
LSEC (text only) 2.03 2.10 2.89
AG-LSEC late fusion 1.80 1.80 2.72
AG-LSEC early fusion 1.56 1.56 2.48

Relative WDER reductions for early fusion are up to 39–41% vs. SD+ASR baseline and 14–26% vs. LSEC. Similar gains, though slightly smaller, are seen for late fusion, which remains data-efficient.

Model Errors Corrected (%) Errors Introduced (%)
LSEC 29.20 8.40
AG-LSEC late fusion 38.18 9.20
AG-LSEC early fusion 44.53 6.60

Early fusion achieves the highest rate of error correction with the lowest rate of new speaker-labeling errors. The largest empirical improvements occur at overlap regions and speaker-change boundaries, while over-correction in single-speaker regions is minimized with acoustic grounding.

5. Analysis and Implications

Acoustic speaker scores derived from EEND provide cues related to actual speaker activity (e.g., pitch, timbre) omitted from the lexical input, reducing both over-correction and under-correction by the LLM. Early fusion—where acoustic and lexical features are jointly re-encoded—consistently delivers the largest absolute and relative WDER reductions. Late fusion, involving a shallow fusion network post-hoc, is advantageous in scenarios with limited paired audio-text data (Ldiar=1TSminϕperm(S)t=1TBCE(ytϕ,pt).\mathcal L_{\mathrm{diar}} = \frac{1}{T S} \min_{\phi \in \mathrm{perm}(S)} \sum_{t=1}^T \mathrm{BCE}(\boldsymbol y_t^\phi, \boldsymbol p_t).720 min suffices for significant improvement above LSEC).

A plausible implication is that acoustic grounding with streaming or modular diarization pipelines can be efficiently retrofitted without extensive retraining, making AG-LSEC practical in resource-constrained domains.

6. Limitations and Prospective Directions

AG-LSEC is currently limited to English and supports only local two-speaker windows; utterance spans with more than two concurrent speakers are bypassed. Model gains may diminish under low signal-to-noise conditions or when diarization outputs are themselves unreliable. Future work includes extending AG-LSEC to multilingual settings, supporting dynamic speaker counting and overlapping scenarios, and joint end-to-end training of SD and LSEC components. The integration with LLMs is also highlighted as an area for enhancement. Incorporating dynamic context windows and robust speaker posteriors may further advance performance and applicability (Paturi et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Audio-Grounded LSEC (AG-LSEC).