MossFormer2-Separated Speech

Updated 16 November 2025

The paper introduces a novel hybrid architecture combining self-attention and FSMN modules to model both global and local temporal dependencies in monaural speech separation.
Key results show a SI-SNR improvement of +1.3 dB over MossFormer, achieving state-of-the-art performance on benchmarks like WSJ0 and Libri2Mix.
Methodologically, dense connections and gated convolutional units enable robust gradient flow and efficient computation while balancing model complexity with performance.

MossFormer2-Separated Speech refers to monaural speech separation using the MossFormer2 architecture, a hybrid neural network model designed for time-domain source separation. MossFormer2 enhances the MossFormer framework by integrating a self-attention-based module with a feedforward sequential memory network (FSMN)–based recurrent module. This combination targets modeling both long-range (coarse) and fine-scale (recurrent) temporal dependencies in speech, yielding state-of-the-art results on several benchmark datasets.

1. Architectural Overview

MossFormer2 processes a monaural speech mixture $x \in \mathbb{R}^{1 \times T}$ through the following pipeline:

Encoder: A 1-D convolutional layer (kernel size 16, stride 8) followed by ReLU produces an embedding $X \in \mathbb{R}^{N \times S}$ .
Separator: An R-layered stack of hybrid blocks, each comprising:
- MossFormer (self-attention module): Implements joint local-global self-attention. Local heads restrict attention to context windows; global heads utilize linearized attention, achieving $O(S \cdot N)$ complexity.
- Recurrent Module: An RNN-free block based on dilated FSMN, employing gated convolutional units (GCUs) and dense inter-layer connections for modeling fine-scale patterns.
Mask Estimator: A $1 \times 1$ convolution produces $C$ masks (one per source).
Masked Embedding Application: The masks are applied to $X$ to yield source-specific embeddings.
Decoder: A transposed 1-D convolutional layer (mirror of encoder) reconstructs time-domain separated signals $\widehat{s}_i$ .

Hybrid Block Structure:

The attention block models long-range dependencies.
The FSMN block, organized as bottleneck $\rightarrow$ GCU $\rightarrow$ output layers, models short-range, local recurrences. The recurrent module applies convolutions and linear projections for fully parallel sequence processing.
The GCU comprises two Conv-U branches yielding $U$ and $V$ , with $V$ passed through the dilated FSMN to produce $Y$ . The output is $O = X + U \odot Y$ with residual skip connections integrated.
Dense connections within FSMN connect all intermediate activations within each block, facilitating gradient flow and broadening receptive fields.

2. Mathematical Formulation

2.1 Gated Convolutional Unit (GCU)

Given $X \in \mathbb{R}^{N' \times S}$ :

$U = \text{Conv}_U(X)$
$V = \text{Conv}_U(X)$
$Y = \text{Dilated\_FSMN}(V)$
$O = X + U \odot Y$

Conv-U is defined by: $\begin{align*} Z_1 &= \text{LayerNorm}(X) \ Z_2 &= W_1 Z_1 + b_1 \ Z_3 &= \text{SiLU}(Z_2) \ Z_4 &= \text{DConv}_{1d}(Z_3) \ \text{Conv}_U(X) &= W_2 Z_4 + b_2 + X \end{align*}$

2.2 Dilated FSMN Memory Taps

After an FFN: $Z^0[t] = W_f V[t] + b_f$

Memory output at time $t$ : $M[t] = \sum_{\ell=1}^{L} h_\ell * Z^0[t-d_\ell]$ where $d_\ell = 2^{\ell-1}$ is the dilation and $*$ denotes 2-D convolution over grouped channels.

Dense connections are used: $X_\ell = H_\ell(\text{concat}(X_0, ..., X_{\ell-1}))$ with $H_\ell =$ (padding $\rightarrow$ $2$-D-Conv $_{d_\ell}$ $\rightarrow$ InstanceNorm $\rightarrow$ PReLU) $+$ concatenation.

2.3 Bottleneck and Output Layers

$\begin{align*} X_{bn} &= \text{PReLU}(\text{Conv}_{1\times1}(X_{in})) \ O &\text{ (via GCU)} \ Y_{out} &= \text{Conv}_{1\times1}(\text{LayerNorm}(O)) \end{align*}$

3. Objective Function and Training Paradigm

The loss function is based solely on scale-invariant SNR (SI-SNR):

Let $\alpha = \langle \widehat{s}, s \rangle / \|s\|^2$ and $\widehat{s}_{\text{target}} = \alpha s$ ,

$\text{SI-SNR}(\widehat{s}, s) = 10 \log_{10} \left( \frac{ \|\widehat{s}_{\text{target}}\|^2 }{ \|\widehat{s} - \widehat{s}_{\text{target}}\|^2 } \right)$

$\mathcal{L}_{\text{SI-SNR}} = -\text{SI-SNR}$

No auxiliary losses or regularization terms are used beyond gradient clipping ( $\|g\|_2 \leq 5$ ).

4. Empirical Evaluation and Results

Datasets:

WSJ0-2mix/3mix: Clean mixtures; 30 h train, 10 h dev, 5 h test.
Libri2Mix: Clean mixtures from LibriSpeech; 106 h train, 5.5 h dev/test.
WHAM!: WSJ0-2mix with added realistic noise (DEMAND).
WHAMR!: Reverberant version of WSJ0-2mix.

Training Procedure:

Optimizer: Adam, initial lr = $1.5 \times 10^{-4}$ (constant for 85 epochs, then halved, up to 200 epochs), batch size 1.
Dynamic mixing applied for all but Libri2Mix.

Model and Hyperparameters:

Encoder kernel: 16, stride: 8.
MossFormer layers $R$ : large = 24, small = 25.
Embedding $N$ : large = 512, small = 384.
FSMN bottleneck $N'$ = 256.
FSMN blocks $L$ = 2; dilations = $\{1, 2\}$ .
Parameters: large $\approx 55.7$ M, small $\approx 37.8$ M.

SI-SNRi (dB) Results:

Model	WSJ0-2mix	WSJ0-3mix	Libri2Mix	WHAM! / WHAMR!	Params (M)	RTF (V100)
Conv-TasNet	15.3	---	---	---	5.1	---
DPRNN	18.8	---	---	---	2.6	---
SepFormer	22.3	19.5	19.2	16.4 / 14.0	25.7	---
QDPN	23.6	---	---	--- / 14.4	200	---
SFSRNet	24.0	---	20.4	---	59.0	---
MossFormer	22.8	21.2	19.7	17.3 / 16.3	42.0	0.038
MossFormer2	24.1	22.2	21.7	18.1 / 17.0	55.7	0.053

Ablation studies indicate that dilations, dense connections, and GCU are each critical to peak performance.

5. Analysis and Technical Insights

Self-attention mechanisms, as used in MossFormer, are effective at modeling global, long-range context but insufficient for representing local and recurrent speech features, such as phonemic and prosodic cycles. The RNN-free FSMN module, employing dilated, grouped convolutions with memory taps and dense connections, addresses this limitation by explicitly capturing local recurrence in a fully parallelizable fashion at $O(S)$ cost per layer.

GCU enables dynamic modulation of memory features injected per time step and integrates with residual skip connections.
Dense connections expand receptive fields and improve gradient propagation without excessive parameterization.
SI-SNR is the sole training objective, and gradient clipping prevents divergence without additional regularization.

The hybrid architecture yields systematic improvements: MossFormer2 achieves +1.3 dB SI-SNRi over MossFormer and surpasses prior models including SepFormer, DPRNN, and QDPN on speech separation tasks. The increase in parameter count (+13M) and real-time factor (+0.015) is moderate relative to the performance gain.

6. Practical Considerations and Recommendations

For deployment and further model development, selection of the recurrent bottleneck size $N'$ and the number of FSMN layers $L$ allows for cost-quality tradeoffs. Dynamic mixing is beneficial for limited datasets. SI-SNR should be used as the objective, with gradient norms clipped to improve training stability.

This suggests that future improvements could be realized by elaborating dilation schedules, deepening dense connections, or embedding FSMN-based recurrent modules into alternative architectures such as Conformer variants. Employing only linear projections in place of Conv-U, or omitting dense connections or GCU, leads to measurable performance degradation, emphasizing the necessity of these components in the MossFormer2 framework.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Mossformer2-Separated Speech.