MSSFNet: Multi-Scale Fusion Neural Architecture

Updated 25 February 2026

The paper introduces MSSFNet, which uses a msmsfblock combining four parallel asymmetric convolutional streams to capture diverse spatial features for edge detection.
It employs a two-stage fusion strategy that integrates fine and coarse details, achieving competitive metrics on datasets like BSDS500 without relying on pretraining.
MSSFNet's design extends to stereo image super-resolution and remote sensing, demonstrating flexibility and robustness across various imaging applications.

MSSFNet refers to a family of neural architectures emphasizing multi-scale and multi-stream/multi-branch fusion, predominantly for edge detection and stereo image super-resolution. The term “MSSFNet” is most notably defined as “Msmsfnet: a multi-stream and multi-scale fusion net for edge detection” (Liu et al., 2024), but also appears (sometimes as “MSSF-Net”) in stereoSR (Gao et al., 2024) and building interpretation (Huo et al., 1 Apr 2025) literature. This article reviews the core formulation, mathematical structure, training regime, and empirical impact of MSSFNet for edge detection, with brief comparative notes on related usages.

1. Network Architecture Overview

MSSFNet, in the edge detection setting (Liu et al., 2024), is a fully convolutional encoder of depth 74, constructed by recursive stacking of a “msmsfblock” that embodies parallel convolutional streams with diverse receptive fields. Five max-pooling layers (3×3, stride 2) are interleaved to yield multi-resolution feature hierarchies.

The main block, msmsfblock, takes as input $X\in\mathbb R^{C\times H\times W}$ and processes it along four parallel convolutional streams:

Stream 1 (receptive field 1×1): $T_1 = W_1^{1\times1} * X$
Stream 2 (3×3): $T_2 = W_2^{3\times1} * (W_2^{1\times3} * X)$
Stream 3 (5×5): $T_3 = W_3^{5\times1} * (W_3^{1\times5} * X)$
Stream 4 (7×7): $T_4 = W_4^{7\times1} * (W_4^{1\times7} * X)$

Following the multi-stream convolution, the features are fused in a two-stage system:

First fusion: $S_\ell = T_1 + T_2$ , $S_h = T_3 + T_4$
Final fusion: $Y = \mathrm{ReLU}(W_f^{3\times1} * (S_\ell + S_h))$

No batch normalization is used; only the final convolution within each block is followed by a ReLU. Deep supervision is applied via four side outputs after select blocks, mimicking the HED paradigm.

2. Multi-Scale Fusion Principle

Central to MSSFNet is the msmsfblock, which simultaneously encodes fine and coarse spatial features across scales $\{1,3,5,7\}$ . All larger convolutions (3×3, 5×5, 7×7) are expanded into factorized $1\times n$ and $n\times 1$ layers to control parameter growth. Table 1 below summarizes stream structure:

Stream	Receptive Field	Parameterization
1	1×1	$1\times1$ conv
2	3×3	$1\times3$ , $3\times1$
3	5×5	$1\times5$ , $5\times1$
4	7×7	$1\times7$ , $7\times1$

This design ensures that both local boundaries and broader context are encoded, addressing the trade-off between localization and context aggregation in edge detection.

3. Objective Functions and Optimization

Let $u$ denote an input image, with pixelwise binary ground-truth edges $\mathbb G_j$ . Defining $\lambda = |\mathbb G_-|/|\mathbb G|$ as class balance, the side-output losses at pixel $j$ and block $m\in\{1,2,3\}$ are

$\ell_{\mathrm{side}}^{(m)} = -\,\lambda\,\sum_{j\in\mathbb G_+} \log \sigma(a_j^{(m)}) - 1.1(1-\lambda) \sum_{j\in\mathbb G_-} \log(1-\sigma(a_j^{(m)}))$

where $a_j^{(m)}$ are the pre-sigmoid activations, and the negative class is upweighted by 1.1 (as in RCF). The total side loss, plus a fused output loss after a final 3×3 conv, is minimized over all weights.

No batch normalization or pre-trained initialization is used: all models are optimized from scratch with Adam (weight decay $1\times10^{-12}$ ), mini-batch size 6, and learning rate $1\times10^{-4}$ , decayed after a set number of epochs depending on the dataset.

4. Empirical Performance and Generalization

When trained from scratch—circumventing all reliance on ImageNet pretraining—MSSFNet surpasses canonical edge detectors in standard metrics on BIPEDv2, BSDS500, and NYUDv2. Results for the main metrics (ODS, OIS, AP) are as follows:

Dataset	Model	ODS	OIS	AP
BIPEDv2	MSSFNet	0.897	0.901	0.936
BSDS500	MSSFNet	0.816	0.835	0.859
NYUDv2 RGB	MSSFNet	0.732	0.747	0.744

On BSDS500, MSSFNet's ODS ($0.816$) exceeds the reported human benchmark ( $\approx0.803$ ) in this regime. The network demonstrates comparable inference speed and complexity to RCF or BDCN, due to heavy reuse of asymmetric convolutions and minimal channel expansion.

Generalization to different modalities is an explicit design goal: MSSFNet abandons pretraining to facilitate application to non-optical data such as SAR imagery, though quantitative SAR benchmarks remain for future work.

The MSSFNet moniker has also been adopted for architectures targeting:

Stereo Image Super-Resolution: “Mixed-Scale Selective Fusion Network (MSSFNet)” incorporates mixed-scale blocks, selective fusion attention (SFAM), and fast Fourier convolution (FFCB) (Gao et al., 2024). It achieves state-of-the-art PSNR/SSIM on standard datasets (e.g., MSSFNet-S: 35.77 dB/0.9555 on Middlebury for 2× upscaling), with ablations confirming the necessity of each component.
Remote Sensing Building Interpretation: “MSSFC-Net” (sometimes “MSSFNet”) fuses a dual-branch multi-scale contextual encoder with spatial-spectral attention and a differential temporal fusion module (Huo et al., 1 Apr 2025), delivering $\Delta\text{IoU}$ gains ( $+1.07\%$ on WHU, $+0.50\%$ on LEVIR-CD) over previous methods.

While architectural details diverge across domains, the unifying theme remains explicit multi-scale convolutional encoding, channel/stream fusion, and intentional avoidance of pre-trained weights to unlock broad generalizability.

6. Practical Recommendations and Prospects

In edge detection, MSSFNet demonstrates that systematic multi-scale fusion via parallel asymmetric convolutions can achieve, and even surpass, the performance of pre-trained deep backbones—when paired with carefully balanced loss terms and well-tuned optimization from random initialization.

A plausible implication is strong suitability for imaging domains lacking large annotation corpora (e.g., SAR or medical), where pretraining is unavailable. Future research directions include empirical validation of MSSFNet on such non-optical datasets and further analysis of multi-stream versus transformer-based fusion paradigms in low-data or cross-modal settings.

7. Summary Table: Distinct MSSFNet Variants

Domain & Paper	Purpose	Key Modules/Fusions
Edge detection (Liu et al., 2024)	Edge detection from scratch	msmsfblock: 4-stream parallel asymmetric conv fusion
StereoSR (Gao et al., 2024)	Stereo image SR	Mixed-scale block, SFAM, FFCB
Building extraction/change detection (Huo et al., 1 Apr 2025)	Remote sensing dual tasks	DMFE (dual branch), SSFC, MDFM

Each MSSFNet instantiation is defined by the use of explicit multi-scale fusion, either within blocks or across parallel streams, often combined with specialized attention or inter-feature weighting. Empirical results support the conclusion that this architectural philosophy affords strong performance, namely precision, recall, and intersection-over-union, in highly challenging pixel-wise vision tasks.