Papers
Topics
Authors
Recent
Search
2000 character limit reached

AttentionUNet-OBIA: Hybrid Forest Mapping

Updated 5 January 2026
  • The paper presents a hybrid method combining AttentionUNet with OBIA to deliver both pixel-level discrimination and object-level interpretability for forest/non-forest classification.
  • The methodology employs a UNet-style encoder-decoder with attention gates alongside mean-shift segmentation, enhancing feature focus and spatial coherence in high-resolution remote sensing data.
  • The approach achieves state-of-the-art performance (OA 95.64%, IoU 0.9064) and outperforms traditional OBIA and other deep learning variants in identifying forest cover.

AttentionUNet-OBIA is a hybrid forest cover mapping methodology that integrates a deep learning model—AttentionUNet—with Object-Based Image Analysis (OBIA) for high-resolution multispectral remote sensing image analysis. Developed within the "ForCM" pipeline for Sentinel-2 imagery, it achieves state-of-the-art accuracy for forest/non-forest classification in the Amazon Rainforest, providing both pixel-wise discrimination and object-level interpretability with open-source tools (Haque et al., 29 Dec 2025).

1. Architecture and Attention Mechanism

The core of AttentionUNet-OBIA is a UNet-style encoder–decoder architecture augmented with attention gates (AG). The input consists of 512×512×C512 \times 512 \times C images (C=C= 3 or 4 bands). The model comprises four encoding stages, a bottleneck, and four decoding stages that symmetrically mirror the encoder. Each encoder level ℓ\ell applies two consecutive 3×33\times3 convolutions with ReLU activations (optionally batch-normalized), doubling feature channels at each downsampling (e.g., 64→128→256→51264 \to 128 \to 256 \to 512). Spatial resolution is reduced by 2×22 \times 2 max-pooling (stride 2). The bottleneck contains two 3×33\times3 convolutions at 1024 channels.

Decoding consists of 2×22\times2 transposed convolution (up-convolution) to increase spatial resolution and halve channel count. At each decoder level ℓ\ell, an attention gate receives the upsampled decoder signal gℓg^\ell and the encoder skip feature C=C=0. The AG formula follows Oktay et al. (2018):

\begin{align*} f_x &= W_x\,x\ell, \quad f_g = W_g\,g\ell \ \Psi_\text{int} &= \operatorname{ReLU}(f_x + f_g + b) \ \alpha\ell &= \sigma(\psiT \Psi_\text{int} + b_\psi) \ x{\ell\,\prime} &= \alpha\ell \odot x\ell \end{align*}

where C=C=1 are learned convolutions, C=C=2 the sigmoid, and C=C=3 element-wise multiplication. The AG output C=C=4 emphasizes salient features and suppresses irrelevant regions before skip connection concatenation. The final feature map passes through a C=C=5 convolution, followed by sigmoid activation, yielding a pixel-wise probability map C=C=6.

2. Data Preprocessing and Input Modalities

Input data are Sentinel-2 Level-2A images, pre-corrected for atmospheric effects using ESA Sen2Cor. Band selection includes both three-band (RGB) and four-band (RGB plus NIR, all at 10 m) sets. Input normalization policies are:

  • Three-band images: divide by 255, cast to float32, yielding values in C=C=7.
  • Four-band images: cast to float32, per-band divided by maximum reflectance (C=C=8), then rescaled to C=C=9.

No additional spectral indices such as NDVI are computed; only raw bands are provided to the model.

3. OBIA: Segmentation and Feature Extraction

Post-prediction, OBIA is performed in QGIS (v3.34.5, Orfeo Toolbox v8.1.2) using mean-shift segmentation, chosen for robust unsupervised object delineation. Mean-shift parameters are set to spatial radius â„“\ell0, range radius â„“\ell1, and minimum object size â„“\ell2 pixels (trial-and-error selected). Each resulting image object â„“\ell3 yields a feature vector â„“\ell4, where â„“\ell5 are mean band reflectances and â„“\ell6 is the mean AttentionUNet pixelwise probability within â„“\ell7. Optional features (not used in this work) include area, perimeter, compactness, and texture.

4. Fusion, Classification, and Post-processing

The classification stage fuses AttentionUNet-derived and OBIA-obtained features at object level. From the segmented object set ℓ\ell8, ℓ\ell9 are randomly sampled (visually-checked stratification) for manual ground-truth labeling. A linear-kernel Support Vector Machine (SVM, 3×33\times30) is trained to map 3×33\times31 to forest/non-forest labels. Label assignment is performed by evaluating SVM score 3×33\times32 with a threshold at 0 (or probability 0.5, if calibrated). Post-processing removes objects 3×33\times33 pixels by morphological opening, followed by boundary smoothing via a 3×33\times34 majority filter to reduce salt-and-pepper noise.

5. Training Protocol and Hyperparameters

Datasets are split as follows: for 3-band sets, V1: 3×33\times35 train/3×33\times36 val/3×33\times37 test; V2: 3×33\times38 train/3×33\times39 val/64→128→256→51264 \to 128 \to 256 \to 5120 test; V3: 64→128→256→51264 \to 128 \to 256 \to 5121 train/64→128→256→51264 \to 128 \to 256 \to 5122 val/64→128→256→51264 \to 128 \to 256 \to 5123 test; and for the 4-band set: 64→128→256→51264 \to 128 \to 256 \to 5124 train/64→128→256→51264 \to 128 \to 256 \to 5125 val/64→128→256→51264 \to 128 \to 256 \to 5126 test. Training applies binary cross-entropy loss with Adam optimizer (64→128→256→51264 \to 128 \to 256 \to 5127, 64→128→256→51264 \to 128 \to 256 \to 5128), batch size 64→128→256→51264 \to 128 \to 256 \to 5129, initial learning rate 2×22 \times 20, with ReduceLROnPlateau scheduling (factor 2×22 \times 21, patience 2×22 \times 22), and no class weighting (class balance assumed). Data augmentation is limited to random horizontal/vertical flips performed on-the-fly. Training epochs: 20 for V1, 10 for V2, V3, and 4-band. Hardware employed: Intel i7-class CPU, 32 GB RAM, NVIDIA GeForce GTX TITAN X 12 GB GPU.

6. Evaluation Metrics and Comparative Results

Performance is assessed using mean Intersection over Union (IoU), overall accuracy (OA), precision, recall, and F1-score, computed on randomly selected test images. AttentionUNet-OBIA achieves:

Metric Value
OA 95.64 %
IoU 0.9064
Precision 93.32 %
Recall 96.84 %
F1-score 0.9504

Comparative results show that AttentionUNet-OBIA surpasses traditional OBIA (OA 92.91 %, IoU 0.8992, F1 0.9365) and other DL-OBIA variants such as ResUNet-OBIA (OA 94.54 %, IoU 0.9101, F1 0.9525). Standalone AttentionUNet (no OBIA) attains OA 95.93 % and IoU 0.9168 on the 4-band test set. An example confusion matrix for 1000 test pixels:

Pred Forest Pred Non-Forest
True Forest 581 19
True Non-Forest 27 373

This evaluates to 2×22 \times 23.

7. Workflow Schematic

The pipeline is summarized as follows:

  1. Load images and ground-truth masks.
  2. Normalize bands to 2×22 \times 24 (float32).
  3. Partition datasets into train/val/test.
  4. Construct AttentionUNet:
    • Encoder (levels 2×22 \times 25): 2×22 \times 26
    • Bottleneck: 2×22 \times 27
    • Decoder (2×22 \times 28): 2×22 \times 29 two 3×33\times30
    • Output: 3×33\times31 Conv (sigmoid).
  5. Train with Adam optimizer and BCE loss, 10–20 epochs.
  6. Inference: output 3×33\times32 probability map.
  7. Segment objects with mean-shift (QGIS/OTB, 3×33\times33).
  8. For each object 3×33\times34, compute 3×33\times35 (bands), 3×33\times36; assemble feature 3×33\times37.
  9. Train linear SVM on labeled 3×33\times38.
  10. Classify all 3×33\times39 as forest/non-forest.
  11. Assign SVM label to all pixels within 2×22\times20 for final raster.
  12. Post-process with small-object removal and majority filter.

A plausible implication is that the approach leverages spatial coherence, spectral consistency, and pixel-level DL inference for highly accurate, interpretable mapping, facilitated by accessible open-source software (Haque et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AttentionUNet-OBIA.