Audio-Grounded LSEC for Speaker Label Correction
- The paper demonstrates that integrating acoustic speaker scores with lexical corrections significantly reduces word diarization error rates, with early fusion achieving up to a 41% reduction over baselines.
- AG-LSEC employs both early and late fusion mechanisms to combine acoustic and lexical information, enhancing corrections at challenging overlap and speaker-change regions.
- Empirical results indicate that early fusion not only delivers higher error correction rates but also minimizes new speaker-labeling errors, making it efficient even in resource-constrained scenarios.
Audio-Grounded Lexical Speaker Error Correction (AG-LSEC) is a hybrid framework designed to improve word-level speaker labeling accuracy in automated speech transcription pipelines. It extends the Lexical Speaker Error Correction (LSEC) paradigm by explicitly integrating acoustic speaker scores from a neural speaker diarization subsystem, providing additional grounding to mitigate errors introduced by timing deviations and overlapping speech. The approach combines word-level lexical cues with acoustic information via early or late fusion architectures and achieves substantial reductions in Word Diarization Error Rate (WDER) across several English conversational benchmarks (Paturi et al., 2024).
1. Background and Motivation
Speaker Diarization (SD), the task of producing a transcript marked with "who spoke what when," is commonly realized as a modular pipeline comprising an audio-only SD system and an Automatic Speech Recognition (ASR) system. The SD module yields frame-level speaker posteriors or hard labels, whereas the ASR system outputs word transcripts with corresponding time stamps. A downstream reconciliation step aligns words with the most likely speaker label based on frame overlap. However, limitations arise due to errors at speaker-change points, overlapping speech, and slight misalignment between SD and ASR outputs.
To address reconciliation-induced speaker errors, LSEC leverages a transformer-based lexical LLM (LM) to post-process and correct speaker attributions at the word level using only text. While exploiting conversational turn-taking cues, LSEC's reliance on lexical context alone can result in over-corrections (lexically plausible but acoustically incorrect labels) or under-corrections, particularly in highly disfluent or overlapping regions.
AG-LSEC is motivated by the hypothesis that fusing acoustic speaker activity signals, specifically word-level "speaker scores" derived from the diarization system, with lexical correction models can yield more accurate assignments. This acoustic grounding is particularly beneficial near boundary and overlap regions, providing a complementary signal to the LLM's lexical cues.
2. Model Structure and Acoustic-Lexical Fusion
The AG-LSEC architecture builds on the SD + ASR + LSEC pipeline by introducing two fusion mechanisms for integrating acoustic speaker scores into the lexical correction process: early fusion and late fusion.
2.1 Acoustic Speaker Score Extraction
An End-to-End Neural Diarization (EEND) system processes input acoustic features to estimate frame-level speaker posteriors:
where is the number of speakers. The diarization network is trained using permutation-free binary cross-entropy:
Post-processing applies a median filter to smooth the posteriors:
Word-level pooling is performed over ASR-derived boundaries to :
across all speakers, then normalized:
The resulting forms the set of word-level acoustic speaker scores for each word 0.
2.2 Early Fusion
In early fusion, each word 1 is tokenized into sub-words, and context-sensitive sub-word embeddings 2 are obtained via a pre-trained LLM (e.g., RoBERTa). The word-level speaker score 3 is concatenated to the embedding for the first sub-word of 4, and a "don't care" vector is used for subsequent sub-words. The concatenated vectors
5
are passed through a Transformer encoder, after which a softmax layer produces per-sub-word speaker posteriors. The final word-level speaker posteriors 6 are taken from the first sub-word of each word.
2.3 Late Fusion
Late fusion first runs standard LSEC over the transcript, yielding lexical speaker posteriors 7. For each word, the acoustic speaker score 8 is concatenated to the lexical posterior and input to a small feed-forward neural network ("FusionNet"):
9
FusionNet yields the final speaker posterior 0 per word.
3. Training Objectives and Implementation
3.1 Loss Functions
The diarization module is trained with permutation-free binary cross-entropy loss as above. The correction head for both fusion types uses a standard cross-entropy objective over the true speaker label 1:
2
No additional regularization is applied.
3.2 Training Data and Hyperparameters
Training initializes from a text-only LSEC checkpoint. Paired gold-standard data are drawn from Fisher corpus (single-channel two-speaker conversations with gold transcripts/labels). The optimizer is Adam with learning rate 3, batch size 32, and approximately 30 words per batch for 20 epochs, distributed over 8 GPUs. Training is performed on sliding windows comprising at most two locally active speakers.
4. Evaluation: Metrics, Datasets, and Empirical Performance
4.1 Word-Diarization Error Rate (WDER)
WDER jointly penalizes ASR substitution, deletion, and insertion errors while also measuring speaker-label assignment errors at the word level. After aligning hypothesis with reference transcripts (using asclite), WDER is computed as
4
where 5, 6, and 7 denote word-level substitutions, deletions, and insertions, respectively.
4.2 Datasets
- Fisher: 82000 two-speaker telephone conversations, 9100 hours, moderate overlap (015%).
- CallHome American English (CHAE): 1500 calls, two speakers each, 220 hours, 310% overlap.
- RT03-CTS: 440 calls, two speakers each, 53 hours, 612% overlap.
4.3 Empirical Results
| Model | Fisher | RT03-CTS | CHAE |
|---|---|---|---|
| Baseline (SD+ASR) | 2.56 | 2.64 | 3.45 |
| LSEC (text only) | 2.03 | 2.10 | 2.89 |
| AG-LSEC late fusion | 1.80 | 1.80 | 2.72 |
| AG-LSEC early fusion | 1.56 | 1.56 | 2.48 |
Relative WDER reductions for early fusion are up to 39–41% vs. SD+ASR baseline and 14–26% vs. LSEC. Similar gains, though slightly smaller, are seen for late fusion, which remains data-efficient.
| Model | Errors Corrected (%) | Errors Introduced (%) |
|---|---|---|
| LSEC | 29.20 | 8.40 |
| AG-LSEC late fusion | 38.18 | 9.20 |
| AG-LSEC early fusion | 44.53 | 6.60 |
Early fusion achieves the highest rate of error correction with the lowest rate of new speaker-labeling errors. The largest empirical improvements occur at overlap regions and speaker-change boundaries, while over-correction in single-speaker regions is minimized with acoustic grounding.
5. Analysis and Implications
Acoustic speaker scores derived from EEND provide cues related to actual speaker activity (e.g., pitch, timbre) omitted from the lexical input, reducing both over-correction and under-correction by the LLM. Early fusion—where acoustic and lexical features are jointly re-encoded—consistently delivers the largest absolute and relative WDER reductions. Late fusion, involving a shallow fusion network post-hoc, is advantageous in scenarios with limited paired audio-text data (720 min suffices for significant improvement above LSEC).
A plausible implication is that acoustic grounding with streaming or modular diarization pipelines can be efficiently retrofitted without extensive retraining, making AG-LSEC practical in resource-constrained domains.
6. Limitations and Prospective Directions
AG-LSEC is currently limited to English and supports only local two-speaker windows; utterance spans with more than two concurrent speakers are bypassed. Model gains may diminish under low signal-to-noise conditions or when diarization outputs are themselves unreliable. Future work includes extending AG-LSEC to multilingual settings, supporting dynamic speaker counting and overlapping scenarios, and joint end-to-end training of SD and LSEC components. The integration with LLMs is also highlighted as an area for enhancement. Incorporating dynamic context windows and robust speaker posteriors may further advance performance and applicability (Paturi et al., 2024).