Modular Speaker Architecture (MSA)

Updated 26 May 2026

Modular Speaker Architecture (MSA) is a framework that partitions multi-speaker speech processing into distinct modules for separation, diarization, and ASR.
It enables independent development and tuning of each component, thereby enhancing interpretability and flexible adaptation to new domains.
Empirical results show that while modular systems offer robust performance and easy error analysis, they face challenges such as error accumulation across module boundaries.

A Modular Speaker Architecture (MSA) refers to an explicitly partitioned system for speaker attribution, diarization, or interaction—most frequently encountered in multi-speaker speech processing, but also deployed in multi-agent dialogue frameworks and assistive perception. Unlike joint or monolithic networks, the modular principle isolates core sub-tasks (e.g., speech separation, diarization, speaker identification, attribution, role tracking) into distinct, separately trainable (or replaceable) components. This results in enhanced interpretability, flexible adaptation to new domains, and often greater exploitation of heterogeneous training resources. However, it also presents characteristic failure modes associated with error accumulation and limited optimization across module boundaries.

1. Fundamental Modular Structures and Their Domains

The canonical MSA for monaural speaker-attributed automatic speech recognition (SA-ASR) is structured as a cascade of three modules: Continuous Speech Separation (CSS), Speaker Diarization (embedding-based segmentation and clustering), and Single-Speaker ASR (attention-based encoder–decoder) (Kanda et al., 2021). In general, each module has clearly defined input/output domains and functional roles:

Module	Input Domain	Output Domain	Functional Role
Continuous Speech Separation (CSS)	Monaural waveform segment	Two time-aligned waveforms (one per speaker)	Up to 2-way overlap separation
Speaker Diarization	Separated waveforms or original audio	Speaker clusters of speech regions	Partitioning, speaker count
ASR	Concatenated speaker segments	Transcription per speaker	Single-speaker recognition

This staged decomposition is mirrored in multi-channel modular diarization, where spatial cues are further exploited via multi-stage refinement (CSS, cACGMM, guided source separation) (Wang et al., 2024), and generalized, in non-audio domains, to agent-role/responsibility context modules in multi-agent dialogue MSA frameworks (Toh et al., 1 Jun 2025). In all cases, boundaries between modules are engineered to support independent development, tuning, or replacement.

2. Detailed Network Architectures and Algorithms

2.1 Speech Separation

CSS modules typically employ deep Conformer stacks—e.g., 18-layer encoders with multi-head self-attention, sandwich feed-forward layers (1024→512), and time-frequency convolutional sublayers (Kanda et al., 2021). Output comprises time-frequency masks, optimized under permutation-invariant SI-SNR loss: $\mathcal{L}_\mathrm{sep} = -\max_{\pi \in \mathrm{Perm}(\{1,2\})} \sum_{k=1}^2 \mathrm{SI\text{-}SNR}(w_{\pi(k)}, \hat w_k)$

2.2 Speaker Embedding and Clustering

Standard practice is to extract frame-wise embeddings (e.g., Res2Net or x-vectors over 1.5s windows) and construct cosine-affinity matrices for clustering. Speaker counting proceeds via normalized eigengap analysis; spectral clustering is applied, followed by leakage filtering based on cluster centroid similarity thresholds to reject collapsed clusters (Kanda et al., 2021). Modular multi-channel systems exploit cACGMM posteriors for spatially informed masks, iteratively refined by EM (Wang et al., 2024).

2.3 Single-Speaker ASR

The ASR module is typically a deep attention-based encoder–decoder; e.g., 2 strided convolutional layers (subsample ×4), 18 Conformer layers, followed by a 6-layer Transformer decoder. Input features are log-mel filterbanks, and target units are typically subword tokens; training employs standard cross-entropy loss: $\mathcal{L}_{\mathrm{ASR}} = -\sum_{n=1}^N \log p(y_n \mid y_{1:n-1}, X)$

3. Module Connectivity, Training Protocols, and Data

Front-end segmentation is achieved with aggressive VAD to create short waveform segments (≤20s) (Kanda et al., 2021). In the presence of overlaps, modules are connected as follows:

Original waveform → CSS → VAD-split → Embedding extraction → Pooling/clustering → Concatenation → ASR per cluster.
Alternative (no CSS): direct VAD → embedding → clustering → ASR.

Speaker embedding models (Res2Net, x-vector) are pre-trained on large-scale identification corpora (e.g., VoxCeleb1&2), while ASR modules are pre-trained and fine-tuned on large-scale and real-domain meeting data (e.g., 75k hours plus AMI-SDM), with CSS separation trained on simulated multi-speaker mixtures (Kanda et al., 2021).

4. Performance Metrics and Empirical Results

Speech-attributed output is evaluated with concatenated-minimum-permutation word error rate (cpWER) and speaker counting error (SCE). Table summarizes cpWER on AMI-SDM (Kanda et al., 2021):

System	Sub (%)	Del (%)	Ins (%)	cpWER (%)	SCE
M1 (no CSS)	7.8	18.6	2.2	28.6	0.56
M2 (CSS)	9.2	11.6	4.0	24.8	0.00
J1 (E2E)	8.5	10.1	6.4	25.0	0.39
J2 (E2E + clustering)	8.5	9.0	5.0	22.6	0.56

Notably, CSS substantially reduces deletion errors (cpWER ↓3.8pp versus M1). After small-scale fine-tuning, joint architectures outperform modular pipelines by 8.9–29.9% relative; modular wins only before fine-tuning. Diarization errors (mainly VAD) dominate cpWER gap (3.3% in IHM-MIX, 2.9% in SDM, for J2). Oracle analyses confirm VAD as the limiter (Kanda et al., 2021).

5. Interpretability, Adaptability, and Error Propagation

MSA offers interpretable, debug-friendly development: each block is independently testable/diagnosable. Partial failures (e.g., missed overlap in CSS, over/under-clustering) are traceable, and individual modules can be adapted to new domains (e.g., fine-tuning ASR on target language or accent) without retraining the full system. However, the error propagation problem is intrinsic:

Upstream errors in CSS or diarization are irrecoverable for downstream ASR.
Even high-performing ASR cannot recover from mis-segmented or mis-attributed speech regions. Joint end-to-end pipelines, though harder to interpret and less data-flexible, are optimal across the entire mapping and outperform modular systems if enough matched training data is available (Kanda et al., 2021).

6. Extensions and Contextual Generalizations

The modularity principle extends beyond strictly acoustic MSA. In multi-agent AI systems, an MSA formalism decomposes dialogue into Speaker Role, Responsibility Chain Tracker, and Contextual Integrity modules. These support the explicit modeling of turn-taking, responsibility transfer, and context-drift. Structural metrics (Pragmatic Consistency, Responsibility Flow, Context Stability) are formally scored, and prototype configuration languages such as G-Code enable API-control of module parameters at deployment (Toh et al., 1 Jun 2025).

Analogous advances arise in training-free modular pipelines for speaker identity assignment, where off-the-shelf SD and ASR modules are coupled with LLM-based semantic refiners to map pseudo-speaker labels to persistent real-world identities, substantially reducing diarization error rates without retraining (Chen et al., 18 Sep 2025).

Spatial modular pipelines leverage multi-channel input for staged separation (CSS, cACGMM, GSS), outperforming monolithic end-to-end models for far-field, multi-party meetings (Wang et al., 2024).

7. Practical Applications and Trade-offs

MSA underpins robust, adaptable SA-ASR deployment in meetings, healthcare, and assistive perception scenarios:

Integration into multimodal perception frameworks enables speaker “gating” for downstream attention and affect recognition (Anchan et al., 25 Nov 2025).
Exploitation of modularity allows piecemeal updates (e.g., new VAD for noisy environments).
Explicit modularity supports rapid re-targeting and rapid error analysis at the cost of potentially non-optimal performance compared to tuned joint systems, especially as data grows and overlaps become more common.

In summary, the Modular Speaker Architecture remains foundational for speaker-attribution pipelines, offering interpretability, flexible adaptation, and robust domain handling at the cost of error accumulation and, in unconstrained conditions, sub-optimal global accuracy compared to jointly optimized end-to-end systems (Kanda et al., 2021, Landini, 2024).