Speaker-Attributed ASR

Updated 13 May 2026

Speaker-Attributed ASR is a method that transcribes speech while assigning each word to its correct speaker, addressing the 'who spoke what' challenge in multi-talker settings.
It incorporates both modular and end-to-end architectures, leveraging techniques like diarization, token-level attribution, and serialized output training to manage overlapping speech and dynamic speaker inventories.
Recent research shows that joint modeling, advanced clustering, and streaming LLM-based systems significantly reduce error rates and enhance real-time speaker identification.

Speaker-Attributed Automatic Speech Recognition (SAA or SA-ASR) is the paradigm of automatic speech recognition in which each portion of transcribed text is explicitly attributed to the correct speaker, solving the “who spoke what” problem in multi-talker audio. This task, fundamental in meeting transcription, media monitoring, and conversational analytics, requires accurate transcriptions temporally aligned with speaker identities—often in the presence of overlapping speech, unknown or dynamically changing speaker inventories, and long-form audio conditions.

1. Problem Definition and Fundamental Metrics

Speaker-Attributed ASR extends conventional ASR by requiring the predicted output to consist of both the word (or subword/character) token sequence and a speaker label assignment for each token. Formally, given an input waveform $X$ , and (optionally) a speaker profile inventory $D = \{ d_k \}$ , the system produces a sequence $Y = \{ y_n \}$ and corresponding speaker labels $S = \{ s_n \}$ , with $s_n$ denoting the speaker of token $y_n$ (Kanda et al., 2020, Kanda et al., 2021, Aronowitz et al., 13 Apr 2026). Evaluation is based primarily on speaker-attributed word error rate (SA-WER): the proportion of words both recognized incorrectly or wrongly attributed to a speaker, typically under a best-permutation alignment between reference and hypothesis speakers (Kanda et al., 2020, Chang et al., 2021). The concatenated minimum-permutation WER (cpWER) is widely used, especially for long-form or conversational data (Chang et al., 2021, Kanda et al., 2021).

2. Modular and End-to-End Architectures

Early SAA systems adopted a modular pipeline: (a) voice activity detection (VAD) (Cui et al., 2024), (b) speaker diarization (e.g., TS-VAD, EEND, or x-vector clustering) (Morrone et al., 2024), (c) single-talker ASR on speaker-homogeneous segments, and (d) alignment/fusion to produce attributed transcripts (Yu et al., 2022, Shi et al., 2022). These pipelines enable flexibility, modular performance tuning, and are particularly robust in high-resource or diverse acoustic conditions. However, error propagation between modules—especially diarization-ASR boundary mismatches—and redundant computations motivate joint approaches (Kanda et al., 2021, Shi et al., 2022).

Recent research has prioritized end-to-end (E2E) modeling, in which speaker counting, transcription, and identification are performed within a single neural network. Core E2E designs are based on encoder-decoder formalisms, such as attention-based encoder-decoder (AED), transformer/conformer encoder–transformer decoder backbones, and more recently, LLM-based speech-aware decoders (Kanda et al., 2021, Aronowitz et al., 13 Apr 2026). The “serialized output training” (SOT) protocol, which interleaves a special separator token (“<sc>”/“<cc>”) at speaker changes, enables a single output stream to represent multi-speaker transcripts and is now standard in E2E SAA (Kanda et al., 2022, Li et al., 2023).

A representative E2E model factorization for the joint output is

$\log P(Y,S|X,D) = \sum_{n=1}^N [\log P(y_n| y_{<n}, s_{≤n}, X, D) + \gamma \log P(s_n| y_{<n}, s_{<n}, X, D)]$

where $\gamma$ is a scaling parameter for the speaker identification loss (Kanda et al., 2021, Kanda et al., 2020). The output tokens and speaker labels may be predicted either auto-regressively or, in the case of recent non-autoregressive models such as SA-Paraformer, in parallel (Li et al., 2023).

3. Speaker Attribution Mechanisms

Explicit speaker attribution is achieved in various ways, reflecting architectural evolution:

Modular Diarization + ASR: Speaker turns are determined by a diarization backend (classical clustering, EEND, or neural embedding clustering), and these are mapped to ASR hypotheses via alignment (token/word/segment level) (Yu et al., 2022, Morrone et al., 2024). Microsoft’s pipeline for meetings uses spectral clustering over d-vectors, with majority-voting or max-overlap for segment assignment (Kanda et al., 2021).
Word/Token-level Diarization: Recent modular models employ word-level diarization, assigning speaker probabilities to individual recognized tokens via attention to acoustic frames and enrolled embeddings, thus obviating the need for explicit time-stamp alignment (Yu et al., 2022).
Profile-based E2E: In profile-based E2E SAA, a set of speaker embeddings is available or enrolled. At each decoding step, the model attends (cosine-similarity or multi-head attention) to this speaker inventory to predict token-level speaker posteriors (Kanda et al., 2021, Cui et al., 2023).
Query-less/Clustering-based Attribution: Where no profiles are available (open set), clustering is applied to internal representations (e.g., speaker query vectors) generated by the model during decoding, using spectral or agglomerative clustering to resolve speaker identities and counts post hoc (Kanda et al., 2020, Kanda et al., 2021).
Self-Speaker Adaptation: A recent approach eschews explicit speaker embeddings in favor of dynamic speaker-wise speech activity masks, injecting per-speaker adaptation kernels (“SSA modules”) into the deep encoder to create parallel, speaker-focused recognizers operating on the same audio (Wang et al., 27 Jun 2025).

The integration of speaker information is increasingly fine-grained, leveraging joint multi-head attention over ASR and speaker representations, context-aware scoring (Cui et al., 2023), and advanced inventory handling (e.g., f-speaker and i-speaker strategies to handle unknown and irrelevant speakers in NAR models) (Li et al., 2023).

4. Training Objectives, Losses, and Optimization

SAA systems are optimized via multi-task learning, combining the main ASR loss with cross-entropy for speaker classification and, in some advanced models, minimum Bayes risk (MBR) criteria explicitly tailored to speaker-attributed WER (Kanda et al., 2020). For E2E models, joint losses are typically of the form: $L_\text{total} = L_\text{ASR} + \lambda L_\text{spk}$ with variations including:

Speaker-attributed Maximum Mutual Information (SA-MMI) (Kanda et al., 2020)
Minimum Bayes Risk (SA-MBR) over n-best hypotheses with length normalization, directly minimizing expected SA-WER (Kanda et al., 2020)
Inter-CTC auxiliary loss in intermediate encoder layers to regularize frame representations and boost token-synchronous attribution (Li et al., 2023)
Embedding alignment and discrimination loss (EAD) to align learned token-level speaker embeddings with TitaNet or other weakly labeled targets in multilingual pipelines (Nguyen et al., 2024)

Cluster tags, as introduced in speech-aware LLM-based SAA, further supervise the output by encoding both speaker turn and learned speaker cluster index in the transcript (e.g., “[Speaker 2 cluster 42]:”), increasing robustness to unseen speakers (Aronowitz et al., 13 Apr 2026).

5. System Variants and Real-World Architectures

A broad taxonomy of SAA system architectures emerges from recent literature:

Modular toolkits: Systems supporting VAD, diarization (EEND or x-vector-based clustering), speaker identification (closed/open set), and ASR selection, orchestrated via user-defined YAML/JSON configurations and exposing results in web GUIs for real-world domains (Morrone et al., 2024).
Streaming SAA: t-SOT (token-level serialized output training) enables low-latency SAA, supporting sub-second attribution even under overlapping speech, and is extensible to joint speaker identification/diarization using parallel t-vector extraction (Kanda et al., 2022).
Neural Clustering Back-ends: Segment-level discriminative neural clustering (SDNC) replaces spectral clustering for assigning speaker labels, especially when combined in a parallel architecture with SOT ASR for error resilience (Zheng et al., 2024).
Linked Encoder-Decoder Models: Dual-encoder (waveform and global speaker) linked-decoder systems (e.g., DNCASR) jointly optimize ASR and speaker clustering, with link-attention tying token emissions directly to speaker turn predictions for improved overlap handling (Zheng et al., 2 Jun 2025).
Multichannel and Joint Beamforming Approaches: Multichannel SAA integrates beamforming (fixed, hybrid, or fully neural FaSNet) as a preprocessor, followed by ASR (Conformer+Transformer), with joint end-to-end optimization providing up to 9% relative WER gains on real distant-microphone corpora (Cui et al., 2023, Cui et al., 2024, Shi et al., 2022).
LLM-based SAA: Speech-aware LLMs, such as Granite-speech adapted for SAA, generate transcripts in which speaker-attribution tags and text are interleaved. Joint training with synthetic and real conversational data, combined with explicit cluster tag supervision, yields substantial improvements over conventional diarization+ASR (Aronowitz et al., 13 Apr 2026).

6. Empirical Findings, Practical Recommendations, and Limitations

Across converging lines of research, several robust findings and best practices have emerged:

End-to-end joint modeling outperforms modular approaches on real long-form audio after fine-tuning, yielding up to 29.9% relative cpWER reduction (Kanda et al., 2021).
Token/word-level attribution is more robust than frame-level alignment, particularly in high-overlap, rapid-turn settings (Yu et al., 2022).
Streaming and NAR models (e.g., Paraformer variants) achieve approximately 10x speedup over AR baselines at negligible loss in speaker-attributed CER, making them viable for real-world deployment (Li et al., 2023, Kanda et al., 2022).
Cluster tag supervision and synthetic multi-speaker augmentation are critical for SAA with LLMs, with absolute WDER reductions over 30% depending on the dataset (Aronowitz et al., 13 Apr 2026).
Speaker-attribution from ASR transcripts is strikingly robust to word error rate; optimizing for WER alone is weakly coupled with true attribution quality, suggesting multi-task losses and explicit speaker-style modeling are required for optimal SAA (Aggazzotti et al., 11 Jul 2025).
Fine-tuning segmentation and embedding extraction strategies to match test-time diarization/VAD yields up to 28% SER reduction in real meetings (Cui et al., 2024), and SD-derived templates often outperform annotation-based ones.
Attention mechanisms can be leveraged for rough automatic utterance boundary/timing inference, yielding competitive diarization error rates without explicit segmentation models (Kanda et al., 2020).
Neural clustering and linked decoder strategies outperform traditional clustering or parallel architectures in overlapping meeting scenarios, taking full advantage of joint gradients and turn alignment for speaker-index prediction (Zheng et al., 2024, Zheng et al., 2 Jun 2025).

Limitations persist in domain adaptation, overlapping speech with >3–4 speakers, streaming multi-language settings, and fully open-set attribution (especially for unseen speakers or in the absence of meaningful enrollment audio). Model complexity, real-time factor requirements, and memory footprint, especially in LLM-based SAA and multichannel setups, remain active considerations for practical deployments.

7. Future Directions

The SAA research landscape is rapidly evolving toward:

Full end-to-end and streaming SAA, with tight integration of VAD, separation, diarization, and ASR components.
Universal LLM-based architectures capable of flexible instruction-following (“transcribe and denote who is speaking...”), cross-domain adaptation, and dynamic speaker counting.
Self-adaptive and query-less SAA (SSA), where diarization outputs are consumed directly as attention masks, obviating the need for speaker embeddings (Wang et al., 27 Jun 2025).
Improved clustering and attribution in the presence of highly overlapped, unsegmented, and code-switched speech.
Robust handling of long-form, large-room, and multilingual audio via joint beamforming, self-supervised encoders, and synthetic augmentation (Cui et al., 2024, Nguyen et al., 2024).
Integration of style-preserving objectives to decouple WER from true speaker-attribution fidelity (Aggazzotti et al., 11 Jul 2025).

Speaker-Attributed ASR has become a cornerstone task at the intersection of speech recognition, diarization, and conversational understanding, with state-of-the-art research now leveraging joint modeling, advanced attribution strategies, and speech-aware LLMs to deliver accurate, robust "who spoke what" transcriptions at scale.