AttentionUNet-OBIA: Hybrid Forest Mapping
- The paper presents a hybrid method combining AttentionUNet with OBIA to deliver both pixel-level discrimination and object-level interpretability for forest/non-forest classification.
- The methodology employs a UNet-style encoder-decoder with attention gates alongside mean-shift segmentation, enhancing feature focus and spatial coherence in high-resolution remote sensing data.
- The approach achieves state-of-the-art performance (OA 95.64%, IoU 0.9064) and outperforms traditional OBIA and other deep learning variants in identifying forest cover.
AttentionUNet-OBIA is a hybrid forest cover mapping methodology that integrates a deep learning model—AttentionUNet—with Object-Based Image Analysis (OBIA) for high-resolution multispectral remote sensing image analysis. Developed within the "ForCM" pipeline for Sentinel-2 imagery, it achieves state-of-the-art accuracy for forest/non-forest classification in the Amazon Rainforest, providing both pixel-wise discrimination and object-level interpretability with open-source tools (Haque et al., 29 Dec 2025).
1. Architecture and Attention Mechanism
The core of AttentionUNet-OBIA is a UNet-style encoder–decoder architecture augmented with attention gates (AG). The input consists of images ( 3 or 4 bands). The model comprises four encoding stages, a bottleneck, and four decoding stages that symmetrically mirror the encoder. Each encoder level applies two consecutive convolutions with ReLU activations (optionally batch-normalized), doubling feature channels at each downsampling (e.g., ). Spatial resolution is reduced by max-pooling (stride 2). The bottleneck contains two convolutions at 1024 channels.
Decoding consists of transposed convolution (up-convolution) to increase spatial resolution and halve channel count. At each decoder level , an attention gate receives the upsampled decoder signal and the encoder skip feature 0. The AG formula follows Oktay et al. (2018):
\begin{align*} f_x &= W_x\,x\ell, \quad f_g = W_g\,g\ell \ \Psi_\text{int} &= \operatorname{ReLU}(f_x + f_g + b) \ \alpha\ell &= \sigma(\psiT \Psi_\text{int} + b_\psi) \ x{\ell\,\prime} &= \alpha\ell \odot x\ell \end{align*}
where 1 are learned convolutions, 2 the sigmoid, and 3 element-wise multiplication. The AG output 4 emphasizes salient features and suppresses irrelevant regions before skip connection concatenation. The final feature map passes through a 5 convolution, followed by sigmoid activation, yielding a pixel-wise probability map 6.
2. Data Preprocessing and Input Modalities
Input data are Sentinel-2 Level-2A images, pre-corrected for atmospheric effects using ESA Sen2Cor. Band selection includes both three-band (RGB) and four-band (RGB plus NIR, all at 10 m) sets. Input normalization policies are:
- Three-band images: divide by 255, cast to float32, yielding values in 7.
- Four-band images: cast to float32, per-band divided by maximum reflectance (8), then rescaled to 9.
No additional spectral indices such as NDVI are computed; only raw bands are provided to the model.
3. OBIA: Segmentation and Feature Extraction
Post-prediction, OBIA is performed in QGIS (v3.34.5, Orfeo Toolbox v8.1.2) using mean-shift segmentation, chosen for robust unsupervised object delineation. Mean-shift parameters are set to spatial radius 0, range radius 1, and minimum object size 2 pixels (trial-and-error selected). Each resulting image object 3 yields a feature vector 4, where 5 are mean band reflectances and 6 is the mean AttentionUNet pixelwise probability within 7. Optional features (not used in this work) include area, perimeter, compactness, and texture.
4. Fusion, Classification, and Post-processing
The classification stage fuses AttentionUNet-derived and OBIA-obtained features at object level. From the segmented object set 8, 9 are randomly sampled (visually-checked stratification) for manual ground-truth labeling. A linear-kernel Support Vector Machine (SVM, 0) is trained to map 1 to forest/non-forest labels. Label assignment is performed by evaluating SVM score 2 with a threshold at 0 (or probability 0.5, if calibrated). Post-processing removes objects 3 pixels by morphological opening, followed by boundary smoothing via a 4 majority filter to reduce salt-and-pepper noise.
5. Training Protocol and Hyperparameters
Datasets are split as follows: for 3-band sets, V1: 5 train/6 val/7 test; V2: 8 train/9 val/0 test; V3: 1 train/2 val/3 test; and for the 4-band set: 4 train/5 val/6 test. Training applies binary cross-entropy loss with Adam optimizer (7, 8), batch size 9, initial learning rate 0, with ReduceLROnPlateau scheduling (factor 1, patience 2), and no class weighting (class balance assumed). Data augmentation is limited to random horizontal/vertical flips performed on-the-fly. Training epochs: 20 for V1, 10 for V2, V3, and 4-band. Hardware employed: Intel i7-class CPU, 32 GB RAM, NVIDIA GeForce GTX TITAN X 12 GB GPU.
6. Evaluation Metrics and Comparative Results
Performance is assessed using mean Intersection over Union (IoU), overall accuracy (OA), precision, recall, and F1-score, computed on randomly selected test images. AttentionUNet-OBIA achieves:
| Metric | Value |
|---|---|
| OA | 95.64 % |
| IoU | 0.9064 |
| Precision | 93.32 % |
| Recall | 96.84 % |
| F1-score | 0.9504 |
Comparative results show that AttentionUNet-OBIA surpasses traditional OBIA (OA 92.91 %, IoU 0.8992, F1 0.9365) and other DL-OBIA variants such as ResUNet-OBIA (OA 94.54 %, IoU 0.9101, F1 0.9525). Standalone AttentionUNet (no OBIA) attains OA 95.93 % and IoU 0.9168 on the 4-band test set. An example confusion matrix for 1000 test pixels:
| Pred Forest | Pred Non-Forest | |
|---|---|---|
| True Forest | 581 | 19 |
| True Non-Forest | 27 | 373 |
This evaluates to 3.
7. Workflow Schematic
The pipeline is summarized as follows:
- Load images and ground-truth masks.
- Normalize bands to 4 (float32).
- Partition datasets into train/val/test.
- Construct AttentionUNet:
- Encoder (levels 5): 6
- Bottleneck: 7
- Decoder (8): 9 two 0
- Output: 1 Conv (sigmoid).
- Train with Adam optimizer and BCE loss, 10–20 epochs.
- Inference: output 2 probability map.
- Segment objects with mean-shift (QGIS/OTB, 3).
- For each object 4, compute 5 (bands), 6; assemble feature 7.
- Train linear SVM on labeled 8.
- Classify all 9 as forest/non-forest.
- Assign SVM label to all pixels within 0 for final raster.
- Post-process with small-object removal and majority filter.
A plausible implication is that the approach leverages spatial coherence, spectral consistency, and pixel-level DL inference for highly accurate, interpretable mapping, facilitated by accessible open-source software (Haque et al., 29 Dec 2025).