MSSFNet: Multi-Scale Fusion Neural Architecture
- The paper introduces MSSFNet, which uses a msmsfblock combining four parallel asymmetric convolutional streams to capture diverse spatial features for edge detection.
- It employs a two-stage fusion strategy that integrates fine and coarse details, achieving competitive metrics on datasets like BSDS500 without relying on pretraining.
- MSSFNet's design extends to stereo image super-resolution and remote sensing, demonstrating flexibility and robustness across various imaging applications.
MSSFNet refers to a family of neural architectures emphasizing multi-scale and multi-stream/multi-branch fusion, predominantly for edge detection and stereo image super-resolution. The term “MSSFNet” is most notably defined as “Msmsfnet: a multi-stream and multi-scale fusion net for edge detection” (Liu et al., 2024), but also appears (sometimes as “MSSF-Net”) in stereoSR (Gao et al., 2024) and building interpretation (Huo et al., 1 Apr 2025) literature. This article reviews the core formulation, mathematical structure, training regime, and empirical impact of MSSFNet for edge detection, with brief comparative notes on related usages.
1. Network Architecture Overview
MSSFNet, in the edge detection setting (Liu et al., 2024), is a fully convolutional encoder of depth 74, constructed by recursive stacking of a “msmsfblock” that embodies parallel convolutional streams with diverse receptive fields. Five max-pooling layers (3×3, stride 2) are interleaved to yield multi-resolution feature hierarchies.
The main block, msmsfblock, takes as input and processes it along four parallel convolutional streams:
- Stream 1 (receptive field 1×1):
- Stream 2 (3×3):
- Stream 3 (5×5):
- Stream 4 (7×7):
Following the multi-stream convolution, the features are fused in a two-stage system:
- First fusion: ,
- Final fusion:
No batch normalization is used; only the final convolution within each block is followed by a ReLU. Deep supervision is applied via four side outputs after select blocks, mimicking the HED paradigm.
2. Multi-Scale Fusion Principle
Central to MSSFNet is the msmsfblock, which simultaneously encodes fine and coarse spatial features across scales . All larger convolutions (3×3, 5×5, 7×7) are expanded into factorized and layers to control parameter growth. Table 1 below summarizes stream structure:
| Stream | Receptive Field | Parameterization |
|---|---|---|
| 1 | 1×1 | conv |
| 2 | 3×3 | , |
| 3 | 5×5 | , |
| 4 | 7×7 | , |
This design ensures that both local boundaries and broader context are encoded, addressing the trade-off between localization and context aggregation in edge detection.
3. Objective Functions and Optimization
Let denote an input image, with pixelwise binary ground-truth edges . Defining as class balance, the side-output losses at pixel and block are
where are the pre-sigmoid activations, and the negative class is upweighted by 1.1 (as in RCF). The total side loss, plus a fused output loss after a final 3×3 conv, is minimized over all weights.
No batch normalization or pre-trained initialization is used: all models are optimized from scratch with Adam (weight decay ), mini-batch size 6, and learning rate , decayed after a set number of epochs depending on the dataset.
4. Empirical Performance and Generalization
When trained from scratch—circumventing all reliance on ImageNet pretraining—MSSFNet surpasses canonical edge detectors in standard metrics on BIPEDv2, BSDS500, and NYUDv2. Results for the main metrics (ODS, OIS, AP) are as follows:
| Dataset | Model | ODS | OIS | AP |
|---|---|---|---|---|
| BIPEDv2 | MSSFNet | 0.897 | 0.901 | 0.936 |
| BSDS500 | MSSFNet | 0.816 | 0.835 | 0.859 |
| NYUDv2 RGB | MSSFNet | 0.732 | 0.747 | 0.744 |
On BSDS500, MSSFNet's ODS ($0.816$) exceeds the reported human benchmark () in this regime. The network demonstrates comparable inference speed and complexity to RCF or BDCN, due to heavy reuse of asymmetric convolutions and minimal channel expansion.
Generalization to different modalities is an explicit design goal: MSSFNet abandons pretraining to facilitate application to non-optical data such as SAR imagery, though quantitative SAR benchmarks remain for future work.
5. Related Multi-Scale/Multi-Stream Fusion Networks in Vision
The MSSFNet moniker has also been adopted for architectures targeting:
- Stereo Image Super-Resolution: “Mixed-Scale Selective Fusion Network (MSSFNet)” incorporates mixed-scale blocks, selective fusion attention (SFAM), and fast Fourier convolution (FFCB) (Gao et al., 2024). It achieves state-of-the-art PSNR/SSIM on standard datasets (e.g., MSSFNet-S: 35.77 dB/0.9555 on Middlebury for 2× upscaling), with ablations confirming the necessity of each component.
- Remote Sensing Building Interpretation: “MSSFC-Net” (sometimes “MSSFNet”) fuses a dual-branch multi-scale contextual encoder with spatial-spectral attention and a differential temporal fusion module (Huo et al., 1 Apr 2025), delivering gains ( on WHU, on LEVIR-CD) over previous methods.
While architectural details diverge across domains, the unifying theme remains explicit multi-scale convolutional encoding, channel/stream fusion, and intentional avoidance of pre-trained weights to unlock broad generalizability.
6. Practical Recommendations and Prospects
In edge detection, MSSFNet demonstrates that systematic multi-scale fusion via parallel asymmetric convolutions can achieve, and even surpass, the performance of pre-trained deep backbones—when paired with carefully balanced loss terms and well-tuned optimization from random initialization.
A plausible implication is strong suitability for imaging domains lacking large annotation corpora (e.g., SAR or medical), where pretraining is unavailable. Future research directions include empirical validation of MSSFNet on such non-optical datasets and further analysis of multi-stream versus transformer-based fusion paradigms in low-data or cross-modal settings.
7. Summary Table: Distinct MSSFNet Variants
| Domain & Paper | Purpose | Key Modules/Fusions |
|---|---|---|
| Edge detection (Liu et al., 2024) | Edge detection from scratch | msmsfblock: 4-stream parallel asymmetric conv fusion |
| StereoSR (Gao et al., 2024) | Stereo image SR | Mixed-scale block, SFAM, FFCB |
| Building extraction/change detection (Huo et al., 1 Apr 2025) | Remote sensing dual tasks | DMFE (dual branch), SSFC, MDFM |
Each MSSFNet instantiation is defined by the use of explicit multi-scale fusion, either within blocks or across parallel streams, often combined with specialized attention or inter-feature weighting. Empirical results support the conclusion that this architectural philosophy affords strong performance, namely precision, recall, and intersection-over-union, in highly challenging pixel-wise vision tasks.