Spatially-Augmented S2S Neural Diarization

Updated 14 October 2025

The paper introduces a novel architecture that fuses DOA-based spatial cues with a sequence-to-sequence diarization model to enhance speaker discrimination.
SA-S2SND employs a staged training strategy with single-channel DOA augmentation and cross-channel attention to optimize performance in diverse acoustic environments.
Empirical evaluations demonstrate significant diarization error rate reductions and robust performance in both online and offline multi-speaker meeting applications.

Spatially-Augmented Sequence-to-Sequence Neural Diarization (SA-S2SND) refers to a family of diarization systems that combine powerful sequence-to-sequence neural architectures with explicit spatial (e.g., direction-of-arrival, or DOA) cues derived from multi-channel audio. This integration enhances speaker discrimination and robustness, especially in complex acoustic environments such as multi-speaker meetings. The SA-S2SND framework advances upon standard S2SND models by fusing spatial features with acoustic representations during both training and inference, yielding measurable performance improvements—particularly in diarization error rate (DER)—in both online and offline applications (Li et al., 10 Oct 2025).

1. Core Architecture and Spatial Augmentation

The underlying S2SND backbone comprises:

A ResNet-based feature extractor yielding frame-level acoustic embeddings.
A Conformer encoder for modeling long-range temporal dependencies.
Two symmetric decoders: a representation decoder to generate speaker embeddings, and a detection decoder to output framewise speaker activities.

SA-S2SND augments these components with external DOA information:

DOA cues are estimated using an SRP-DNN module. This module outputs a matrix $O$ encoding the per-frame DOA probability across azimuth bins (typically 5° increments over $[-180^\circ, 180^\circ)$ ).
The DOA matrix $O$ is upsampled and projected through a linear layer to the hidden dimension $D$ .
The spatial representation is fused into the encoder output $X$ as follows:

$X \leftarrow X + \frac{\mathrm{Linear}(\mathrm{interpolate}(O))}{\sqrt{D}}$

This embedding process is analogous to positional encoding and explicitly injects spatial context into the neural backbone, improving the model's capability to resolve overlapping speakers and spatially-close talkers.

2. Staged Training Regime

SA-S2SND employs a multistage training strategy to maximize both generalizability and spatial awareness:

Part A (Single-Channel DOA-Augmented Training):

Single-channel audio (including simulated mixtures) is used, paired with either real (SRP-DNN estimated) or simulated DOA annotations.
Pseudo-DOA simulation involves randomly assigning azimuth values to active speakers, with random jitter incorporated for data diversity.
Training proceeds through three sub-stages: ResNet freezing, mixed data training (real and simulated), and final fine-tuning.

Part B (Multi-Channel and Channel Attention Training):

The model is extended to multi-channel inputs by adding a cross-channel attention branch.
Cross-channel attention operates over per-frame embeddings from each microphone, capturing complex spatial patterns.
The extended model is first trained with fixed backbone, then the entire architecture is unfrozen and jointly optimized.

This strategic curriculum enables robust spatial feature integration, efficient learning from limited real multi-channel corpora, and smooth transition to multi-channel inference (Li et al., 10 Oct 2025).

3. Simulated DOA Generation and Data Scalability

A distinguishing aspect of SA-S2SND is its simulated DOA generation, which mitigates the paucity of large, matched multi-channel datasets:

For simulated mixtures, active segments (VAD-based) are assigned DOA traces by sampling random trajectories with framewise jitter.
These pseudo-DOA features are then processed the same way as real DOA features.
The model thus becomes adept at attending to spatial cues even when trained predominantly with single-channel data.

This simulation-based augmentation ensures scalability and generalization across recording conditions and microphone geometries without overfitting to a single spatial setup.

4. Channel Attention Mechanisms

The multi-channel version of SA-S2SND (sometimes denoted MC-S2SND or E4) incorporates a cross-channel attention mechanism:

For each time step, embeddings from $C$ channels are concatenated; attention is computed along the channel axis.
This allows the model to exploit spatial diversity and inter-channel redundancies, further disambiguating speakers in reverberant or overlapping settings.
Channel attention operates alongside DOA guidance, with learned inter-channel weights modulating the fused representation.

Notably, the empirical combination of explicit DOA features and channel attention yields synergistic performance gains, exceeding either approach applied in isolation. Reported improvements include a >19% relative DER reduction (compared to S2SND) in offline diarization (Li et al., 10 Oct 2025).

5. Mathematical Formulations and DOA Feature Injection

The spatial feature injection and supervision are grounded in explicit mathematical formulations.

DOA Target Generation:

For a given frame $n$ , the target for each microphone pair $(m,m')$ is aggregated as:

$R_{mm'}(n) = \sum_{k=1}^K \beta_k(n) \cdot r_{mm'}(\theta_k(n))$

where $r_{mm'}(\theta_k)$ is the DP-IPD vector at azimuth $\theta_k$ and $\beta_k(n)$ is the activity probability.

SRP-DNN Spatial Spectrum:

The SRP-like spectrum for candidate DOA $\theta$ is computed as:

$P'(\theta; n) = \frac{2}{M(M-1)F} \sum_{m=1}^{M-1} \sum_{m'>m}^M \Re\left\{ \hat{R}_{mm'}(n)^\mathrm{H} r_{mm'}(\theta) \right\}$

( $M$ microphones, $F$ frequency bins).

Spatial Cues to Encoder Fusion:

After upsampling and linear projection to $D$ dimensions, the DOA matrix is added as a residual to the encoder feature stream.

Loss Combination:

Training uses a hybrid of Binary Cross-Entropy (BCE) for speaker activity (detection decoder) and ArcFace loss for speaker embedding discrimination (representation decoder).

6. Empirical Evaluation and Application Significance

On the AliMeeting corpus, SA-S2SND demonstrates:

A 7.4% relative DER reduction in offline mode over the S2SND baseline (DER drops from 13.59% to 12.59%).
More than 19% DER improvement when channel attention is combined with spatial augmentation.
Notable robustness in both simple (1–2 speaker) and complex (multi-speaker, high-overlap) meeting conditions.

The framework supports both online (block-wise) and offline (global rescoring) inference by integrating spatial cues at the encoder stage—making it suitable for streaming meeting transcription as well as post-hoc diarization and speaker-attributed ASR.

7. Practical Considerations and Limitations

The SA-S2SND approach enables:

Enhanced separation of co-located or overlapping speakers by leveraging explicit spatial information.
Scalability to new multi-microphone environments via pseudo-DOA data augmentation.
Improved performance without requiring vast fully multi-channel annotated corpora.

Challenges include:

Ensuring the quality and temporal resolution of DOA estimates, whether real or simulated.
Balancing computational costs inherent in multi-channel feature extraction and attention modules.
Integrating spatial cues when true DOA estimation is unreliable due to microphone geometry or environmental conditions.

Nonetheless, empirical results confirm that spatial augmentation substantially improves neural diarization performance in realistic multi-speaker meeting scenarios (Li et al., 10 Oct 2025).

In summary, Spatially-Augmented Sequence-to-Sequence Neural Diarization (SA-S2SND) systematically incorporates DOA-derived spatial cues into a S2SND backbone using staged training, simulated spatial data augmentation, and channel attention mechanisms. This integration leads to improved diarization accuracy, speaker separation robustness, and application flexibility for multi-channel meeting applications.

PDF Markdown Chat (Pro)

References (1)

Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Spatially-Augmented Sequence-to-Sequence Neural Diarization (SA-S2SND).