Gated Cross-Attention for Diarization in EEND-M2F

Updated 22 January 2026

The paper introduces a novel gated cross-attention mechanism that dynamically masks irrelevant frames, significantly reducing Diarization Error Rate in EEND-M2F.
The approach reconceptualizes diarization as a 1-D instance segmentation task using learnable queries and deep supervision, eliminating the need for clustering.
Empirical results across benchmarks show that this method improves convergence, cluster purity, and overall efficiency in challenging multi-speaker scenarios.

Gated cross-attention for diarization refers to a mechanism in end-to-end neural speaker diarization systems where each speaker-identified latent query dynamically attends only to those time frames that are currently predicted as relevant to its associated speaker. This concept is formalized in the EEND-M2F model, which frames speaker diarization as a 1-D instance segmentation problem analogous to Mask2Former architectures from computer vision, but adapted to sequential speech data. The gating in cross-attention blocks explicitly excludes irrelevant frames, thereby enhancing both convergence efficiency and the fidelity of the learned speaker representations (Härkönen et al., 2024).

1. Reformulation of Diarization as 1-D Instance Segmentation

EEND-M2F reconceptualizes speaker diarization by leveraging the paradigm of instance segmentation from vision. Instead of 2-D per-pixel masks, EEND-M2F operates on 1-D per-frame speaker activity masks. The input $X\in\mathbb{R}^{T\times D'}$ , typically a sequence of log-Mel spectrogram features, is first down-sampled by convolutional layers, transformed by Conformer blocks into a low-resolution embedding $E_{\rm low}\in\mathbb{R}^{\frac{T}{10}\times D}$ , and subsequently up-sampled to produce high-resolution embeddings $E\in\mathbb{R}^{T\times D}$ . Speaker hypotheses are realized as a fixed pool of $N$ learnable queries $Q^{(0)}\in\mathbb{R}^{N\times D}$ , each traversing $L$ stacked decoder layers comprised of cross-attention, self-attention, and feed-forward sub-blocks. This design enables parallel computation of $N$ sets of speaker activity masks, without recourse to clustering or external segmentation (Härkönen et al., 2024).

2. Gated Cross-Attention Mechanism

At each decoder layer $\ell$ , gated (or masked) cross-attention limits each query’s attention mechanism exclusively to frames predicted as relevant to that query’s speaker. Letting $Q^{(\ell)}$ denote the current state of the $N$ queries and $M^{(\ell-1)}\in\{0,1\}^{N\times\frac{T}{10}}$ the binary mask from the previous layer, attention is computed as:

$A^{(\ell)}_{i,t} = \frac {\exp\left((Q^{(\ell)}_i + P_i) \cdot E_{{\rm low},t}^{T}/\sqrt{d} + \gamma M^{(\ell-1)}_{i,t}\right)} {\sum_{t'}\exp\left((Q^{(\ell)}_i + P_i)\cdot E_{{\rm low},t'}^{T}/\sqrt{d} + \gamma M^{(\ell-1)}_{i,t'}\right)},$

where the gating parameter $\gamma$ is set to $-\infty$ when $M_{i,t}=0$ (hard-masking out non-relevant frames), and $0$ otherwise. The result is a tightly focused contextualization for each query, one that reduces spurious cross-talk between speakers and prevents interference from non-target speech segments. After generating updated speaker queries, a per-frame soft mask $\tilde Y^{(\ell)}\in[0,1]^{T\times N}$ is predicted, thresholded, and fed back into the next layer to recursively refine the attention gating (Härkönen et al., 2024).

3. Supervision, Matching, and Training Objective

Since the queries and references are unordered, permutation-invariant matching is realized by solving the Hungarian assignment between predicted masks and ground-truth speaker tracks. The loss for each pair combines binary cross-entropy and Dice loss:

$\mathcal{L}_{\mathrm{match},ij} = \lambda_{\rm dia}\,H(\tilde Y_{\bullet,i},Y_{\bullet,j}) + \lambda_{\rm dice}\,L_{\rm dice}(\tilde Y_{\bullet,i},Y_{\bullet,j}) - \lambda_{\rm cls}\,\hat p_i,$

where $H$ denotes per-frame binary cross-entropy, $L_{\rm dice}$ is the Dice coefficient-based loss, and $\hat p_i$ is the learned probability of a query being a real speaker. The total loss includes matched and unmatched queries, down-weighting negatives to mitigate class imbalance. Deep supervision is applied at all decoder layers, ensuring effective gradient propagation despite the gating operations (Härkönen et al., 2024).

4. Comparative Merits over Traditional Architectures

Standard Transformer-based diarization without gating requires speaker queries to simultaneously identify relevant and ignore irrelevant frames, an approach that degrades amid overlapping or multi-party speech due to excessive distractor signals. In contrast, the masked cross-attention structure gates out irrelevant frames at each layer, sharply focusing learning and representation on the hypothesized target speaker’s segments alone. Empirical ablation demonstrates that introducing masking reduces Diarization Error Rate (DER) by approximately four percentage points absolute; deep supervision is necessary to maintain mask quality through the network’s depth. This approach accelerates optimization, improves cluster purity, and enhances convergence properties, particularly in challenging conversational scenarios (Härkönen et al., 2024).

5. Experimental Findings and SOTA Benchmarks

EEND-M2F sets state-of-the-art performance across multiple public diarization corpora under stringent evaluation: AMI, AliMeeting, RAMC, DIHARD-III, and others. For reference, without dataset-specific tuning, DER metrics obtained are 16.28% on DIHARD-III (relative to 20.69% for EEND-EDA), 13.20% on AliMeeting-far (versus 23.3% for PyAnnote3.1), and 11.13% on RAMC (previous best: 13.58%). Additional fine-tuning further reduces DERs, e.g., 16.07% on DIHARD-III and 13.40% on AliMeeting. All results are achieved with a lightweight (approximately 16 million parameters), real-time, single-model system and without clustering, oracle VAD, or oracle speaker count. These outcomes indicate that the adoption of gated cross-attention in mask-transformer architectures enables end-to-end diarization models that match or surpass traditional multi-stage pipelines (Härkönen et al., 2024).

6. Broader Methodological Connections and Implications

The explicit adaptation of Mask2Former architecture to diarization establishes a conceptual bridge between vision-based instance segmentation and structured prediction in speech. The central operational insight is that masking (gating) in cross-attention not only enforces modularity between speaker attractors but also alleviates optimization noise in permutation-invariant tasks by limiting the candidate evidence for each attractor at every network layer. A plausible implication is that similar architecture could be applicable to other sequence modeling settings where latent causes must be assigned to sequential events in a permutation-agnostic manner. The approach demonstrates that end-to-end masked-attention transformers, when coupled with dynamic gating, obviate the need for clustering or post-processing, simplifying pipelines while achieving strong empirical performance (Härkönen et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

EEND-M2F: Masked-attention mask transformers for speaker diarization (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Cross-Attention for Diarization.