FSDENet: Detail Enhancement in Remote Sensing
- FSDENet is a dual-domain neural network architecture that exploits spatial, frequency, and wavelet features for precise remote sensing segmentation.
- It employs a UNet-style encoder–decoder with modules like MASF, FFDP, and HWDE to fuse multi-scale spatial and spectral information efficiently.
- Experimental results on benchmarks such as LoveDA and iSAID show state-of-the-art performance in handling ambiguous boundaries, shadows, and low-contrast regions.
FSDENet (Frequency and Spatial Domains-based Detail Enhancement Network) is a neural network architecture for high-resolution remote sensing semantic segmentation that jointly exploits multi-scale spatial features, global frequency cues, and local edge/texture details to improve pixel-wise label accuracy, especially in the presence of grayscale transition zones, ambiguous boundaries, shadows, and low-contrast regions. The design integrates ConvNeXt-Small as a backbone, a multi-attention spatial fusion pathway, explicit Fast Fourier domain processing, and a Haar wavelet-based edge enhancement chain to provide dual-domain synergy between feature granularity and boundary precision (Fu et al., 29 Sep 2025).
1. Architectural Overview
FSDENet adopts a UNet-style encoder–decoder topology, where the encoder is built upon a ConvNeXt-Small backbone, extracting four levels of spatial feature maps. The architecture is augmented with three principal functional modules:
- Multi-Attention Select Fusion (MASF): Merges spatial features from multiple resolution scales.
- Fast Fourier Detail Perception (FFDP): Integrates non-local/global spectral cues via frequency-domain transformations and modulations.
- Haar Wavelet Detail Enhancement (HWDE): Refines segmentation near semantic boundaries by decomposing features into localized high-frequency and low-frequency subbands for targeted processing.
The central aim is precise semantic parsing and robust delineation of objects in challenging imagery marked by shadows, grayscale ambiguity, or fine object edges, without imposing the prohibitive computational cost of full self-attention mechanisms.
2. Core Components and Processing Pathways
2.1 Spatial Processing Branch and MASF
The ConvNeXt-Small backbone outputs four feature maps at progressively coarser spatial scales:
with . The MASF block fuses two shallow (x₁, x₂) into and two deep (x₃, x₄) into feature sets, each remapped to . Channel- and spatial-attention mechanisms adaptively recalibrate the information content through horizontal/vertical/global pooling and convolutional gating. The fusion formula is
where denotes a convolution, is the learned attention, and are gating coefficients.
2.2 Frequency-Domain Global Feature Mapping (FFDP)
The FFDP module applies a Discrete Fourier Transform (FFT) to the spatial feature map 0:
1
The resulting spectral representation is modulated by a lightweight “partial convolution,” followed by batch normalization and rectification, and then projected back via inverse FFT. This enables efficient capture of non-local contextual information essential for modeling broad grayscale transitions and suppressing spurious correlations due to shadowing.
Output enhancement is performed by combining the FFT-processed map with two orthogonal large-kernel depth-wise convolutions (2, 3) and the original input:
4
where 5 is the frequency-modulated feature.
2.3 Haar Wavelet Detail Enhancement (HWDE)
HWDE decomposes an input feature 6 using the Haar wavelet transform into four subbands—7 (low frequency), 8, 9, 0 (high frequency). The low-frequency path performs batch normalization, rectification, and a cascade of convolutions and poolings to create multi-scale edge maps, each further enhanced via
1
The high-frequency path aggregates 2, processes with depth-wise 3 convs and reverse bottleneck blocks, then fuses with the low-frequency enhancement under a learned channel attention (CA-layer) mechanism:
4
where 5 and 6 is a sigmoid activation.
3. Dual-Domain Feature Fusion and Decoder Path
Post-MASF, the shallow and deep features 7 are cross-interacted via two passes of channel-agent guided fusion (CAGF) and FFDP, yielding 8 and 9. New fused features 0 are again re-integrated: 1 The four resulting maps are concatenated and dimensionally compressed: 2 Finally, 3 and the HWDE output 4 are concatenated to form the input to the decoder path.
4. Loss Function, Training Regime, and Data Augmentation
FSDENet is trained using pixel-wise cross-entropy loss: 5 where 6 is the total number of pixels, 7 the number of classes, and 8 the one-hot ground truth.
Training employs AdamW optimizer with a base learning rate of 9 and a cosine annealing schedule. Data augmentation includes random scaling between 0–1, horizontal/vertical flipping, and rotation. No additional dice or specialized boundary losses are utilized.
5. Experimental Results and Performance Benchmarks
FSDENet demonstrates state-of-the-art segmentation on multiple high-resolution benchmarks with a ConvNeXt-Small backbone:
- LoveDA: mIoU 2 (3 over SFFNet)
- iSAID: mIoU 4 (parity with SegNeXt-L at lower cost)
- Vaihingen: mF1 5, OA 6, mIoU 7 (SOTA)
- Potsdam: mF1 8, OA 9, mIoU 0 (SOTA)
Ablation on Vaihingen shows that the cumulative effect of MASF, CAGF, FFDP, and HWDE—each individually contributing mIoU increases—culminates in a total mIoU rise from 1 (plain baseline) to 2 (full FSDENet). FSDENet is reported as particularly effective around blurred boundaries and in scenes with significant grayscale or illumination variance.
6. Implementation Considerations
- Backbone: ConvNeXt-Small (3).
- Modules: MASF for spatial attention/fusion; CAGF for cross-agent interaction; FFDP for frequency-global mapping; HWDE for multi-scale edge refinement.
- Efficiency: 4M parameters, 5G FLOPs, 6 FPS at 7 resolution on a Tesla V100.
- Training: Epochs per dataset: LoveDA (30), iSAID (60), Vaihingen/Potsdam (105). Batch sizes: 8 (LoveDA/Potsdam/Vaihingen), 9 (iSAID).
7. Context, Applications, and Comparison
FSDENet addresses a central challenge in high-resolution remote sensing: accurately localizing semantic classes in pixels near ambiguous, shadowed, or low-texture boundaries. By fusing dual-domain cues it achieves a performance level that exceeds or matches state-of-the-art systems, including SFFNet and SegNeXt-L, often at lower computational cost (Fu et al., 29 Sep 2025). The approach of explicit Fourier and wavelet domain integration, as opposed to purely spatial channel/attention processing, allows more robust discrimination of object borders—a critical factor for practical deployment in urban monitoring, agricultural assessment, and infrastructure inspection under varied illumination and imaging conditions.