MossFormer2-Separated Speech
- The paper introduces a novel hybrid architecture combining self-attention and FSMN modules to model both global and local temporal dependencies in monaural speech separation.
- Key results show a SI-SNR improvement of +1.3 dB over MossFormer, achieving state-of-the-art performance on benchmarks like WSJ0 and Libri2Mix.
- Methodologically, dense connections and gated convolutional units enable robust gradient flow and efficient computation while balancing model complexity with performance.
MossFormer2-Separated Speech refers to monaural speech separation using the MossFormer2 architecture, a hybrid neural network model designed for time-domain source separation. MossFormer2 enhances the MossFormer framework by integrating a self-attention-based module with a feedforward sequential memory network (FSMN)–based recurrent module. This combination targets modeling both long-range (coarse) and fine-scale (recurrent) temporal dependencies in speech, yielding state-of-the-art results on several benchmark datasets.
1. Architectural Overview
MossFormer2 processes a monaural speech mixture through the following pipeline:
- Encoder: A 1-D convolutional layer (kernel size 16, stride 8) followed by ReLU produces an embedding .
- Separator: An R-layered stack of hybrid blocks, each comprising:
- MossFormer (self-attention module): Implements joint local-global self-attention. Local heads restrict attention to context windows; global heads utilize linearized attention, achieving complexity.
- Recurrent Module: An RNN-free block based on dilated FSMN, employing gated convolutional units (GCUs) and dense inter-layer connections for modeling fine-scale patterns.
- Mask Estimator: A convolution produces masks (one per source).
- Masked Embedding Application: The masks are applied to to yield source-specific embeddings.
- Decoder: A transposed 1-D convolutional layer (mirror of encoder) reconstructs time-domain separated signals .
Hybrid Block Structure:
- The attention block models long-range dependencies.
- The FSMN block, organized as bottleneck GCU output layers, models short-range, local recurrences. The recurrent module applies convolutions and linear projections for fully parallel sequence processing.
- The GCU comprises two Conv-U branches yielding and , with passed through the dilated FSMN to produce . The output is with residual skip connections integrated.
- Dense connections within FSMN connect all intermediate activations within each block, facilitating gradient flow and broadening receptive fields.
2. Mathematical Formulation
2.1 Gated Convolutional Unit (GCU)
Given :
Conv-U is defined by:
2.2 Dilated FSMN Memory Taps
After an FFN:
Memory output at time : where is the dilation and denotes 2-D convolution over grouped channels.
Dense connections are used: with (padding $2$-D-Conv InstanceNorm PReLU) concatenation.
2.3 Bottleneck and Output Layers
3. Objective Function and Training Paradigm
The loss function is based solely on scale-invariant SNR (SI-SNR):
Let and ,
No auxiliary losses or regularization terms are used beyond gradient clipping ().
4. Empirical Evaluation and Results
Datasets:
- WSJ0-2mix/3mix: Clean mixtures; 30 h train, 10 h dev, 5 h test.
- Libri2Mix: Clean mixtures from LibriSpeech; 106 h train, 5.5 h dev/test.
- WHAM!: WSJ0-2mix with added realistic noise (DEMAND).
- WHAMR!: Reverberant version of WSJ0-2mix.
Training Procedure:
- Optimizer: Adam, initial lr = (constant for 85 epochs, then halved, up to 200 epochs), batch size 1.
- Dynamic mixing applied for all but Libri2Mix.
Model and Hyperparameters:
- Encoder kernel: 16, stride: 8.
- MossFormer layers : large = 24, small = 25.
- Embedding : large = 512, small = 384.
- FSMN bottleneck = 256.
- FSMN blocks = 2; dilations = .
- Parameters: large M, small M.
SI-SNRi (dB) Results:
| Model | WSJ0-2mix | WSJ0-3mix | Libri2Mix | WHAM! / WHAMR! | Params (M) | RTF (V100) |
|---|---|---|---|---|---|---|
| Conv-TasNet | 15.3 | --- | --- | --- | 5.1 | --- |
| DPRNN | 18.8 | --- | --- | --- | 2.6 | --- |
| SepFormer | 22.3 | 19.5 | 19.2 | 16.4 / 14.0 | 25.7 | --- |
| QDPN | 23.6 | --- | --- | --- / 14.4 | 200 | --- |
| SFSRNet | 24.0 | --- | 20.4 | --- | 59.0 | --- |
| MossFormer | 22.8 | 21.2 | 19.7 | 17.3 / 16.3 | 42.0 | 0.038 |
| MossFormer2 | 24.1 | 22.2 | 21.7 | 18.1 / 17.0 | 55.7 | 0.053 |
Ablation studies indicate that dilations, dense connections, and GCU are each critical to peak performance.
5. Analysis and Technical Insights
Self-attention mechanisms, as used in MossFormer, are effective at modeling global, long-range context but insufficient for representing local and recurrent speech features, such as phonemic and prosodic cycles. The RNN-free FSMN module, employing dilated, grouped convolutions with memory taps and dense connections, addresses this limitation by explicitly capturing local recurrence in a fully parallelizable fashion at cost per layer.
- GCU enables dynamic modulation of memory features injected per time step and integrates with residual skip connections.
- Dense connections expand receptive fields and improve gradient propagation without excessive parameterization.
- SI-SNR is the sole training objective, and gradient clipping prevents divergence without additional regularization.
The hybrid architecture yields systematic improvements: MossFormer2 achieves +1.3 dB SI-SNRi over MossFormer and surpasses prior models including SepFormer, DPRNN, and QDPN on speech separation tasks. The increase in parameter count (+13M) and real-time factor (+0.015) is moderate relative to the performance gain.
6. Practical Considerations and Recommendations
For deployment and further model development, selection of the recurrent bottleneck size and the number of FSMN layers allows for cost-quality tradeoffs. Dynamic mixing is beneficial for limited datasets. SI-SNR should be used as the objective, with gradient norms clipped to improve training stability.
This suggests that future improvements could be realized by elaborating dilation schedules, deepening dense connections, or embedding FSMN-based recurrent modules into alternative architectures such as Conformer variants. Employing only linear projections in place of Conv-U, or omitting dense connections or GCU, leads to measurable performance degradation, emphasizing the necessity of these components in the MossFormer2 framework.