- The paper introduces a unified framework that leverages speech-aware LLMs with intermediate layer augmentation to produce direct speaker-labeled transcripts.
- It demonstrates significant performance gains with up to 38% reduction in WDER and 14–21% improvement in WER compared to traditional diarization-plus-ASR systems.
- The study employs LoRA-based fine-tuning and synthetic multi-speaker data augmentation to efficiently integrate ASR and speaker attribution for robust real-world deployment.
Speaker-Attributed ASR via Speech-Aware LLMs
Introduction
The paper "Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMs" (2604.11269) presents a unified framework for Speaker-Attributed ASR (SAA) leveraging the Granite-speech speech-aware LLM. The work addresses the core limitations of conventional multi-stage SAA systems—which typically cascade speaker diarization (SD) and ASR—by proposing a monolithic architecture that directly produces speaker-labeled transcripts. The paper introduces new methods for improving speaker discrimination and system robustness, ablation on synthetic data augmentation, and comprehensive empirical results that illustrate substantial gains over baseline diarization+ASR pipelines.
Background and Motivation
Conventional SAA approaches rely on discrete SD and ASR modules, typically leading to brittle speaker assignments due to error propagation and loss of synergy between recognition and speaker attribution. End-to-end LLM-based architectures have demonstrated competitive ASR performance but their encoders, when pre-trained purely for ASR, typically convey minimal speaker-informative cues. Moreover, these systems are often restricted to relative speaker labeling within single sessions, limiting their effective exploitation of large, multi-conversation datasets.
Core Contributions
The primary technical advancements in this work are as follows:
- Speaker Cluster Identification in SAA: The framework extends traditional SAA by integrating jointly trained speaker cluster identification tags. Unlike role-based (e.g., [Speaker 1]) or session-specific tags, cluster-based identifiers (e.g., [Speaker 1 cluster 42]) enable the model to leverage cross-conversation speaker-group information, yielding improved generalization and attribution accuracy.
- Intermediate Layer Augmentation for Speaker Discriminativity: Since the upstream Conformer-based speech encoder is optimized for ASR via CTC, it encodes insufficient paralinguistic information for reliable speaker identification. The proposed augmentation concatenates an intermediate layer (optimal at layer 3) with the encoder’s output, injecting spectral and speaker-related cues while maintaining language recognition fidelity.
- Synthetic Data Augmentation via Artificial Multi-Speaker Conversations: Recognizing training data sparsity—especially for multi-party conversational contexts—the authors synthetically expand the dataset by concatenating turn-level utterances from existing corpora and single-speaker datasets. This augmentation strategy yields significant performance improvements, particularly for models trained to recognize longer-duration and more complex speaker structures.
- Low-Rank Adaptation (LoRA)-Based Efficient Fine-Tuning: To preserve the strong ASR capabilities of the upstream encoder, the authors freeze audio encoder parameters and employ LoRA for efficient adaptation of the LLM and projector.
Experimental Setup
Datasets and Evaluation
Evaluation spans both two-speaker (Fisher, CallHome English) and multi-speaker (AMI-SDM, GALE, NaturalVoices) domains, with both real and synthetic conversational data chunks of 10–120 seconds. For SAA quality, the principal metric is Word Diarization Error Rate (WDER), which quantifies words misattributed to the incorrect speaker. For ASR quality, conventional Word Error Rate (WER) is reported.
Baseline Systems
Comparison baselines include:
- PyAnnote+ASR: Adopts pyannote.audio for diarization followed by either Whisper or Granite-speech for ASR, then merges segment-level ASR outputs.
- NVIDIA NeMo Diarization+ASR: Uses NeMo's titanet_large for SD paired with a conformer CTC-based ASR, aligning word timestamps to speaker segments.
Results
The paper reports compelling and consistent improvements for the unified SAA system over all diarization+ASR baselines:
| System |
Fisher |
CH |
AMI |
GALE |
| PyAnnote+Whisper (WDER, %) |
11.7 |
17.1 |
23.4 |
12.7 |
| PyAnnote+Granite (WDER, %) |
11.0 |
15.1 |
19.7 |
12.7 |
| NVIDIA NeMo (WDER, %) |
4.3 |
7.1 |
13.7 |
11.5 |
| Best SAA+SID (WDER, %) |
0.9 |
2.1 |
7.8 |
12.2 |
| PyAnnote+Whisper (WER, %) |
54.6 |
66.6 |
43.2 |
29.2 |
| PyAnnote+Granite (WER, %) |
41.2 |
42.5 |
39.9 |
30.6 |
| NVIDIA NeMo (WER, %) |
20.9 |
20.3 |
52.5 |
21.5 |
| Best SAA+SID (WER, %) |
17.7 |
17.8 |
22.9 |
22.6 |
Best SAA+SID is the SAA model with intermediate-layer augmentation, cluster-based speaker tagging, and synthetic data augmentation. Notably, WDER reductions on Fisher, CallHome, and AMI-SDM are dramatic (ranging from 32% to 38%) over the strongest baseline; only GALE exhibits slightly higher diarization accuracy with NeMo, though WER is lower for SAA+SID. The improvements are robust across durations up to 120 sec and diverse speaker configurations.
The cluster-based joint training (SAA+SID) further reduces WDER compared to naive SAA, and the addition of synthetic mixes is critical, especially for multi-party recordings.
Importantly, there is no degradation in ASR accuracy from the proposed augmentations; in fact, on out-of-domain (AMI, GALE) corpora, the approach yields 14–21% WER improvement by virtue of data augmentation.
Theoretical and Practical Implications
The integration of speaker clustering and SAA objectives into a speech-aware LLM demonstrates that ASR-optimized encoders can be repurposed for diarization with minimal cost. This presents a pathway to monolithic ASR+diarization architectures that reduce engineering and error-compounding issues inherent to pipeline approaches, with immediate practical utility in meeting transcription, call analytics, and conversational AI.
The augmentation of synthetic multi-speaker conversational data allows efficient scaling of training for SAA tasks even under limited real-world annotated multi-party data. By freezing the speech encoder and adapting only the downstream LLM components, model deployment remains scalable.
Practically, this SAA framework can be generalized to conversational analytics, transcription with role attribution, or real-time call center diarization, where accuracy and latency constraints are paramount.
Future Directions
Key open challenges remain regarding the extension of this framework to longer-duration input without context fragmentation or speaker drift, more nuanced speaker clustering (potentially hierarchical or embedding-based online clustering), and adaptation to streaming or real-time inference. Additionally, integrating explicit verification or open-set speaker recognition could yield even greater robustness in the presence of unseen speakers or large speaker pools.
Conclusion
This work formalizes a unified and empirically superior pipeline for Speaker-Attributed ASR via a speech-aware LLM architecture. Through intermediate layer augmentation, explicit speaker cluster integration, and strategic synthetic data generation, the proposed method achieves substantial reductions in diarization and recognition error over state-of-the-art diarization+ASR pipelines, without loss of ASR performance. These findings establish a robust foundation for future work on scalable, integrated conversational speech understanding systems.