Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 180 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 66 tok/s Pro
Kimi K2 163 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

MossFormer2-Separated Speech

Updated 16 November 2025
  • The paper introduces a novel hybrid architecture combining self-attention and FSMN modules to model both global and local temporal dependencies in monaural speech separation.
  • Key results show a SI-SNR improvement of +1.3 dB over MossFormer, achieving state-of-the-art performance on benchmarks like WSJ0 and Libri2Mix.
  • Methodologically, dense connections and gated convolutional units enable robust gradient flow and efficient computation while balancing model complexity with performance.

MossFormer2-Separated Speech refers to monaural speech separation using the MossFormer2 architecture, a hybrid neural network model designed for time-domain source separation. MossFormer2 enhances the MossFormer framework by integrating a self-attention-based module with a feedforward sequential memory network (FSMN)–based recurrent module. This combination targets modeling both long-range (coarse) and fine-scale (recurrent) temporal dependencies in speech, yielding state-of-the-art results on several benchmark datasets.

1. Architectural Overview

MossFormer2 processes a monaural speech mixture xR1×Tx \in \mathbb{R}^{1 \times T} through the following pipeline:

  1. Encoder: A 1-D convolutional layer (kernel size 16, stride 8) followed by ReLU produces an embedding XRN×SX \in \mathbb{R}^{N \times S}.
  2. Separator: An R-layered stack of hybrid blocks, each comprising:
    • MossFormer (self-attention module): Implements joint local-global self-attention. Local heads restrict attention to context windows; global heads utilize linearized attention, achieving O(SN)O(S \cdot N) complexity.
    • Recurrent Module: An RNN-free block based on dilated FSMN, employing gated convolutional units (GCUs) and dense inter-layer connections for modeling fine-scale patterns.
  3. Mask Estimator: A 1×11 \times 1 convolution produces CC masks (one per source).
  4. Masked Embedding Application: The masks are applied to XX to yield source-specific embeddings.
  5. Decoder: A transposed 1-D convolutional layer (mirror of encoder) reconstructs time-domain separated signals s^i\widehat{s}_i.

Hybrid Block Structure:

  • The attention block models long-range dependencies.
  • The FSMN block, organized as bottleneck \rightarrow GCU \rightarrow output layers, models short-range, local recurrences. The recurrent module applies convolutions and linear projections for fully parallel sequence processing.
  • The GCU comprises two Conv-U branches yielding UU and VV, with VV passed through the dilated FSMN to produce YY. The output is O=X+UYO = X + U \odot Y with residual skip connections integrated.
  • Dense connections within FSMN connect all intermediate activations within each block, facilitating gradient flow and broadening receptive fields.

2. Mathematical Formulation

2.1 Gated Convolutional Unit (GCU)

Given XRN×SX \in \mathbb{R}^{N' \times S}:

  • U=ConvU(X)U = \text{Conv}_U(X)
  • V=ConvU(X)V = \text{Conv}_U(X)
  • Y=Dilated_FSMN(V)Y = \text{Dilated\_FSMN}(V)
  • O=X+UYO = X + U \odot Y

Conv-U is defined by: Z1=LayerNorm(X) Z2=W1Z1+b1 Z3=SiLU(Z2) Z4=DConv1d(Z3) ConvU(X)=W2Z4+b2+X\begin{align*} Z_1 &= \text{LayerNorm}(X) \ Z_2 &= W_1 Z_1 + b_1 \ Z_3 &= \text{SiLU}(Z_2) \ Z_4 &= \text{DConv}_{1d}(Z_3) \ \text{Conv}_U(X) &= W_2 Z_4 + b_2 + X \end{align*}

2.2 Dilated FSMN Memory Taps

After an FFN: Z0[t]=WfV[t]+bfZ^0[t] = W_f V[t] + b_f

Memory output at time tt: M[t]==1LhZ0[td]M[t] = \sum_{\ell=1}^{L} h_\ell * Z^0[t-d_\ell] where d=21d_\ell = 2^{\ell-1} is the dilation and * denotes 2-D convolution over grouped channels.

Dense connections are used: X=H(concat(X0,...,X1))X_\ell = H_\ell(\text{concat}(X_0, ..., X_{\ell-1})) with H=H_\ell = (padding \rightarrow $2$-D-Convd_{d_\ell} \rightarrow InstanceNorm \rightarrow PReLU) ++ concatenation.

2.3 Bottleneck and Output Layers

Xbn=PReLU(Conv1×1(Xin)) O (via GCU) Yout=Conv1×1(LayerNorm(O))\begin{align*} X_{bn} &= \text{PReLU}(\text{Conv}_{1\times1}(X_{in})) \ O &\text{ (via GCU)} \ Y_{out} &= \text{Conv}_{1\times1}(\text{LayerNorm}(O)) \end{align*}

3. Objective Function and Training Paradigm

The loss function is based solely on scale-invariant SNR (SI-SNR):

Let α=s^,s/s2\alpha = \langle \widehat{s}, s \rangle / \|s\|^2 and s^target=αs\widehat{s}_{\text{target}} = \alpha s,

SI-SNR(s^,s)=10log10(s^target2s^s^target2)\text{SI-SNR}(\widehat{s}, s) = 10 \log_{10} \left( \frac{ \|\widehat{s}_{\text{target}}\|^2 }{ \|\widehat{s} - \widehat{s}_{\text{target}}\|^2 } \right)

LSI-SNR=SI-SNR\mathcal{L}_{\text{SI-SNR}} = -\text{SI-SNR}

No auxiliary losses or regularization terms are used beyond gradient clipping (g25\|g\|_2 \leq 5).

4. Empirical Evaluation and Results

Datasets:

  • WSJ0-2mix/3mix: Clean mixtures; 30 h train, 10 h dev, 5 h test.
  • Libri2Mix: Clean mixtures from LibriSpeech; 106 h train, 5.5 h dev/test.
  • WHAM!: WSJ0-2mix with added realistic noise (DEMAND).
  • WHAMR!: Reverberant version of WSJ0-2mix.

Training Procedure:

  • Optimizer: Adam, initial lr = 1.5×1041.5 \times 10^{-4} (constant for 85 epochs, then halved, up to 200 epochs), batch size 1.
  • Dynamic mixing applied for all but Libri2Mix.

Model and Hyperparameters:

  • Encoder kernel: 16, stride: 8.
  • MossFormer layers RR: large = 24, small = 25.
  • Embedding NN: large = 512, small = 384.
  • FSMN bottleneck NN' = 256.
  • FSMN blocks LL = 2; dilations = {1,2}\{1, 2\}.
  • Parameters: large 55.7\approx 55.7M, small 37.8\approx 37.8M.

SI-SNRi (dB) Results:

Model WSJ0-2mix WSJ0-3mix Libri2Mix WHAM! / WHAMR! Params (M) RTF (V100)
Conv-TasNet 15.3 --- --- --- 5.1 ---
DPRNN 18.8 --- --- --- 2.6 ---
SepFormer 22.3 19.5 19.2 16.4 / 14.0 25.7 ---
QDPN 23.6 --- --- --- / 14.4 200 ---
SFSRNet 24.0 --- 20.4 --- 59.0 ---
MossFormer 22.8 21.2 19.7 17.3 / 16.3 42.0 0.038
MossFormer2 24.1 22.2 21.7 18.1 / 17.0 55.7 0.053

Ablation studies indicate that dilations, dense connections, and GCU are each critical to peak performance.

5. Analysis and Technical Insights

Self-attention mechanisms, as used in MossFormer, are effective at modeling global, long-range context but insufficient for representing local and recurrent speech features, such as phonemic and prosodic cycles. The RNN-free FSMN module, employing dilated, grouped convolutions with memory taps and dense connections, addresses this limitation by explicitly capturing local recurrence in a fully parallelizable fashion at O(S)O(S) cost per layer.

  • GCU enables dynamic modulation of memory features injected per time step and integrates with residual skip connections.
  • Dense connections expand receptive fields and improve gradient propagation without excessive parameterization.
  • SI-SNR is the sole training objective, and gradient clipping prevents divergence without additional regularization.

The hybrid architecture yields systematic improvements: MossFormer2 achieves +1.3 dB SI-SNRi over MossFormer and surpasses prior models including SepFormer, DPRNN, and QDPN on speech separation tasks. The increase in parameter count (+13M) and real-time factor (+0.015) is moderate relative to the performance gain.

6. Practical Considerations and Recommendations

For deployment and further model development, selection of the recurrent bottleneck size NN' and the number of FSMN layers LL allows for cost-quality tradeoffs. Dynamic mixing is beneficial for limited datasets. SI-SNR should be used as the objective, with gradient norms clipped to improve training stability.

This suggests that future improvements could be realized by elaborating dilation schedules, deepening dense connections, or embedding FSMN-based recurrent modules into alternative architectures such as Conformer variants. Employing only linear projections in place of Conv-U, or omitting dense connections or GCU, leads to measurable performance degradation, emphasizing the necessity of these components in the MossFormer2 framework.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Mossformer2-Separated Speech.