Building Disaster Damage Assessment in Satellite Imagery with Multi-Temporal Fusion (2004.05525v1)

Published 12 Apr 2020 in cs.CV

Abstract: Automatic change detection and disaster damage assessment are currently procedures requiring a huge amount of labor and manual work by satellite imagery analysts. In the occurrences of natural disasters, timely change detection can save lives. In this work, we report findings on problem framing, data processing and training procedures which are specifically helpful for the task of building damage assessment using the newly released xBD dataset. Our insights lead to substantial improvement over the xBD baseline models, and we score among top results on the xView2 challenge leaderboard. We release our code used for the competition.

Citations (83)

View on Semantic Scholar

Summary

The paper's main contribution is a multi-temporal fusion CNN that combines pre- and post-disaster imagery to accurately predict building damage levels.
It leverages a shared-weight ResNet50 within a Mask R-CNN/FPN framework, achieving an overall F1 score of 0.738 on the xView2 test set.
The study employs practical techniques like 512x512 image cropping and class-specific loss weighting to overcome data imbalance and enhance model performance.

The paper "Building Disaster Damage Assessment in Satellite Imagery with Multi-Temporal Fusion" (2004.05525) addresses the critical need for rapid and accurate assessment of building damage following natural disasters using satellite imagery. Traditionally, this task is manual and time-consuming, creating a bottleneck in emergency response. The research focuses on leveraging deep learning techniques, specifically Convolutional Neural Networks (CNNs), to automate and accelerate this process, building upon the newly released xBD dataset.

The core problem is framed as a multi-class pixel classification task: given pre-disaster and post-disaster satellite images of an area, predict the damage level for every pixel belonging to a building. The damage levels are typically categorized (e.g., undamaged, minor damage, major damage, destroyed). The paper's key practical contribution lies in identifying specific architectural choices, data processing techniques, and training procedures that significantly improve performance over baseline methods on this task.

A central insight is the importance of leveraging both pre- and post-disaster imagery effectively. Instead of feeding a combined image (e.g., concatenated channels or difference images) or using separate networks for localization and damage, the authors propose a multi-temporal fusion approach. The pre-disaster and post-disaster images are fed independently through a shared-weight CNN backbone (specifically, a ResNet50 backbone within a Mask R-CNN with FPN architecture). The feature maps extracted from both images are then concatenated before being passed to the final semantic segmentation head, which predicts the damage class for each pixel. This allows the network to learn representations from both temporal states and combine them for a final decision.

For implementation, the authors utilized a Mask R-CNN backbone augmented with a Feature Pyramid Network (FPN) and a semantic segmentation head, pretrained on ImageNet. They found that using a single network for both building localization (identifying pixels belonging to a building) and damage assessment (classifying the damage level) was more effective than the baseline's approach of using separate models. The network outputs a 5-class prediction for each pixel: 'no building' (0), 'undamaged' (1), 'minor damage' (2), 'major damage' (3), 'destroyed' (4).

Practical data processing steps are also crucial for achieving high performance:

Image Cropping: The original xBD images are 1024x1024 pixels, but buildings are often small within this resolution. The authors found that training and inference on 512x512 crops (specifically, the four quadrants of the original image) improved results, particularly for building localization.
Multi-Temporal Input: As described above, feeding pre and post images separately through a shared backbone and concatenating features significantly outperformed simpler methods like channel concatenation or image subtraction.

The loss function used is standard cross-entropy, but with a critical modification: class-specific weighting. Since the "no building" class dominates the image area, and the "undamaged" class dominates the building instances in the dataset, standard cross-entropy would heavily favor these majority classes. To mitigate this extreme class imbalance and improve performance on the less frequent damage classes (minor, major, destroyed), the authors weighted the loss for each class inversely proportional to its occurrence frequency in the training data. This encourages the model to pay more attention to correctly classifying the rarer but more critical damage levels.

The implementation was based on the Detectron2 library, which provides PyTorch implementations of various detection and segmentation models like Mask R-CNN. Training the model requires significant computational resources; the authors reported using a machine with 4 NVIDIA 1080 TIs, with training taking around 6 hours to converge. A practical challenge encountered was the tendency for the network to collapse and predict only the 'no building' label if trained for too long, even with class weighting, indicating the difficulty posed by the data imbalance.

The ablation paper presented in the paper demonstrates the impact of these choices:

Switching from instance segmentation (Mask R-CNN with instance head) to semantic segmentation improved localization F1, as instance bounding boxes were poor for small buildings.
Using 512x512 crops further boosted localization F1.
The multi-temporal fusion approach (feeding pre/post through shared backbone, concatenating features, joint prediction) yielded substantial gains across overall, localization, and damage F1 scores.
Adding class-specific loss weighting provided a further improvement in damage and overall F1, despite a slight drop in localization F1 (a trade-off deemed acceptable due to the weighting prioritizing damage).

The final model achieved an overall F1 score of 0.738 on the xView2 competition test set, significantly outperforming the reported xBD baseline (0.265 overall F1). The breakdown shows particularly strong improvements in damage F1 (0.697 vs 0.414 semantic baseline, or 0.697 vs 0.265 overall baseline across all damage categories).

For real-world application, this research suggests a practical pipeline:

Acquire co-registered pre-disaster and post-disaster satellite images of the affected area.
Pre-process images: Potentially split into smaller crops if necessary, ensure consistent resolution and alignment.
Load the trained multi-temporal fusion model (e.g., based on Mask R-CNN/FPN with shared backbone and semantic head).
Feed the pre and post image crops through the model.
Obtain pixel-wise damage predictions.
Post-process results: Reconstruct full-image damage maps from crops, potentially filter small or spurious predictions, overlay predictions on original imagery or GIS data.
Integrate with visualization and reporting tools for use by disaster response teams.

The paper provides the code used for the competition at [https://github.com/ethanweber/xview2], allowing practitioners to replicate and build upon their successful implementation strategies. Future work could explore alternative fusion methods, integrating disaster type information, or using more advanced loss functions like ordinal cross-entropy.

PDF Markdown

Related Papers

GitHub

GitHub - ethanweber/xview2: Code for xView2 challenge (https://xview2.org/) submission. 2nd place submission in Track 3: "Evaluation Only". Ranked ranked 40th on leaderboard before unverified code was removed. Paper at https://arxiv.org/pdf/2004.05525.pdf. Presentation at https://slideslive.com/38926353/building-disaster-damage-assessment-in-satelite-imagery-with-multitemporal-fusion. (76 stars)