MSCloudCAM: Multispectral Cloud Segmentation

Updated 19 October 2025

MSCloudCAM is a multispectral cloud segmentation framework that uses a transformer backbone and attention mechanisms to classify different cloud types from satellite imagery.
The framework integrates a Swin Transformer with ASPP and PSP modules, enabling effective handling of spectral variability and ambiguous cloud boundaries.
Its high segmentation accuracy on datasets like CloudSEN12 and L8Biome demonstrates practical benefits for environmental monitoring and large-scale Earth observation.

MSCloudCAM is a multispectral cloud segmentation framework designed for robust semantic categorization of clouds in satellite imagery from diverse sensors, notably Sentinel-2 and Landsat-8. The model addresses challenges in cloud detection and classification—such as handling spectral variability, ambiguous cloud boundaries, and differentiation of cloud types including thin cloud and cloud shadow—by combining transformer-based hierarchical feature extraction, multi-scale context encoding, and advanced attention mechanisms. MSCloudCAM demonstrates high segmentation accuracy, computational efficiency, and effective generalization across sensors and spectral bands, making it applicable for large-scale environmental monitoring and Earth observation.

1. Model Architecture and Feature Extraction

The architecture of MSCloudCAM is hybrid, built around a Swin Transformer backbone and enriched by multi-scale context and attention modules. The key pipeline is:

Swin Transformer Backbone: Processes multispectral images (e.g., Sentinel-2 with 13 bands, Landsat-8 with 11 bands) through a hierarchy of shifted-window self-attention blocks. This yields a series of feature maps $\{f_1, f_2, f_3, f_4\}$ , where $f_4$ encodes global context and $f_1$ – $f_3$ preserve spatial detail.

$\{f_1, f_2, f_3, f_4\} = E(X), \quad f_i \in \mathbb{R}^{B \times C_i \times H_i \times W_i}$

Multi-Scale Context Encoding: Leveraging two complementary modules:
- ASPP (Atrous Spatial Pyramid Pooling): Applied to $f_4$ , it performs dilated convolutions with rates $\{1, 6, 12, 18\}$ and introduces global average pooling for large receptive fields:
$x_{ASPP} = \bigoplus_{r \in \{1,6,12,18\}} \sigma(\mathrm{Conv}^{(r)}_{3\times3}(f_4)) \oplus \sigma(\mathrm{GAP}(f_4))$ - PSP (Pyramid Scene Parsing): Aggregates context from $f_3$ at grid scales $\{1,2,3,6\}$ , followed by upsampling and nonlinear activations:

$x_{PSP} = \bigoplus_{s \in \{1,2,3,6\}} \sigma(\mathrm{Up}(\mathrm{Pool}_s(f_3)))$

2. Cross-Attention, Channel, and Spatial Modules

A distinguishing innovation in MSCloudCAM is its feature fusion and refinement mechanism:

Cross-Attention Fusion Block: Concatenates $x_{ASPP}$ and $x_{PSP}$ to form $x_{cat}$ , utilizing convolutional multi-head cross-attention wherein PSP features guide the alignment:

$Q = W_Q \cdot x_{cat}, \quad K = W_K \cdot x_{PSP}, \quad V = W_V \cdot x_{PSP}$

$\mathrm{Attn}(Q, K, V) = \mathrm{softmax}(QK^T/\sqrt{d_k})V$

$\tilde{x} = W_O \cdot \mathrm{Attn}(Q, K, V) + x_{cat}$

Bottleneck Projection: Transforms $\tilde{x}$ via three convolutional layers into a compact 512-channel embedding.
Combined Attention Refinement:
- Efficient Channel Attention Block (ECAB): Recalibrates channel-wise features to emphasize spectral cues critical for cloud discrimination.
- Spatial Attention Module: Highlights discriminative regions in the spatial domain.
- Combined Attention Operator:
$z' = \phi_{CA}(z) = \phi_{ECA}(z) \odot \phi_{SA}(z) + z$

3. Semantic Categorization and Datasets

MSCloudCAM targets four semantic categories:

Label	Meaning
0	Clear sky
1	Thick cloud
2	Thin cloud
3	Cloud shadow

Evaluation uses two major datasets:

CloudSEN12 (Sentinel-2): 8,500 training, 500 validation, and 1,000 test samples in both L1C and L2A levels, spanning 13 bands.
L8Biome (Landsat-8 CCA): 11 bands, with semantic categories mapped to the CloudSEN12 taxonomy.

All inputs are normalized by dividing top-of-atmosphere reflectance by 3,000.

4. Training and Deep Supervision

The decoder module progressively upsamples $z'$ using transposed convolutions. Deep supervision is applied through auxiliary outputs at intermediate stages, with the training loss:

$\mathcal{L} = \lambda_{\text{final}}\mathcal{L}(\hat{Y}, Y) + \lambda_1\mathcal{L}(\hat{Y}_1, Y) + \lambda_2\mathcal{L}(\hat{Y}_2, Y)$

Weights are chosen to prioritize the final prediction and stabilize intermediate outputs.

5. Experimental Performance and Complexity

Experimental results demonstrate:

State-of-the-art metrics: On CloudSEN12 and L8Biome, MSCloudCAM surpasses DBNet, CDNetv2, HRCloudNet, and multiple U-Net baselines in mean Intersection-over-Union (mIoU), F1 score, and overall accuracy. For example, MSCloudCAM achieves an mIoU of 75.52% (CloudSEN12 L1C).
Qualitative segmentation: Figures in the referenced paper show improved delineation of thin clouds and shadows in challenging multispectral scenes compared to previous architectures.
Efficiency: MSCloudCAM operates at 38.98 GFLOPs and 47.44 million parameters, balancing accuracy and computational load to support inference on large-scale and possibly onboard satellite platforms.

6. Practical Applications in Earth Observation

MSCloudCAM’s architecture supports generalization across sensors and spectral domains. Notable applications include:

Environmental Monitoring: Reliable identification of cloud cover variability in multispectral Earth observation campaigns.
Land Cover Mapping: Robustly masking clouds and shadows, therefore improving downstream analysis of land surfaces.
Climate Research: Enhanced delineation of cloud structures enables more accurate radiative transfer modeling and climate parameterization.

This suggests that MSCloudCAM is positioned as a reference framework for multispectral cloud segmentation tasks in both research and operational satellite data processing pipelines.

7. Comparative Significance and Future Prospects

By synthesizing hierarchical transformer features, multi-scale context aggregation, and both cross- and channel-spatial attention mechanisms, MSCloudCAM achieves superior results on multispectral cloud segmentation benchmarks. The modularity and parameter efficiency make it practical for scalable deployment. A plausible implication is that such attention-augmented architectures will continue to underpin advances in dense semantic segmentation across a variety of remote sensing modalities. There are no explicit controversies noted regarding its adoption, although future work may address domain adaptation to additional sensors or integration with active sensing data.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to MSCloudCAM.