M3PD Dataset: Multi-Hazard Vision–Language Benchmark
- M3PD dataset is a multi-hazard, multi-sensor, and multi-task corpus that benchmarks vision–language models for disaster damage assessment.
- It integrates bi-temporal optical and SAR imagery with meticulous geometric and radiometric alignment to handle diverse disaster scenarios.
- The extensive annotation across nine tasks yields quantifiable metrics (accuracy, mIoU, captioning scores) critical for advancing AI disaster response.
The M3PD (DisasterM3) dataset is a multi-hazard, multi-sensor, and multi-task remote sensing vision–language corpus, meticulously curated to advance the benchmarking and fine-tuning of large vision-LLMs (VLMs) for disaster damage assessment and response. Comprising 26,988 bi-temporal satellite image pairs and 123,010 instruction–response pairs spanning five continents, M3PD establishes a comprehensive resource for evaluating AI capabilities in recognition, estimation, localization, reasoning, and report-generation across diverse disaster scenarios (Wang et al., 27 May 2025).
1. Dataset Scope and Format
M3PD includes bi-temporal pre- and post-disaster satellite scenes, systematically collected from 36 historically impactful events distributed across Asia, Africa, Europe, North America, and South America. Each scene features aligned image pairs, enabling comparison of disaster impacts. Hazard types are categorized into 10 representative classes: landslide, earthquake, hurricane, flood, tsunami, volcanic eruption, wildfire, tornado, explosion, and conflict. The dataset sources 26 events from xBD/BRIGHT and 10 novel events via Maxar Open Data.
The instruction–response corpus is organized into nine tasks mapped to five high-level capability groups:
- Recognition (Disaster Scene Recognition, Disaster Type Recognition, Bearing-Body Recognition)
- Counting & Estimation (Damaged Building Counting, Damaged Road Estimation)
- Localization (Referring Segmentation)
- Reasoning (Object Relational Reasoning)
- Report Generation (Disaster Captioning, Restoration Advice)
Table 1 summarizes composition and coverage:
| Aspect | Value | Notes |
|---|---|---|
| Total bi-temporal image pairs | 26,988 | Pre/post-disaster, 5 continents |
| Historical disaster events | 36 | 10 hazard categories |
| Instruction–response pairs | 123,010 | 9 tasks, multi-capability |
| Training (Instruct) split | 17,190 optical / 3,798 SAR scenes, 92,968 Q–A pairs | |
| Benchmark (Bench) split | 5,024 optical / 976 SAR scenes, 30,042 Q–A pairs |
2. Multi-Sensor Integration and Image Processing
A core feature of M3PD is its multi-sensor design, pairing high-resolution optical imagery (WorldView series, 0.8 m GSD) with synthetic aperture radar (SAR) amplitude data (Capella and Umbra, VV/HH dual-polarization). Optical sensors provide detailed land cover, but are highly susceptible to occlusion by clouds or smoke. In contrast, SAR imaging enables penetration of such obstructions, ensuring disaster scene access under extreme conditions.
All images are preprocessed for geometric and radiometric alignment:
- Optical: Resampled to 0.8 m, ortho-corrected.
- SAR: Terrain-corrected, intensity range stretched to [0,255], resampled to 0.8 m.
- Spatial alignment by georeferencing achieves sub-pixel co-registration in both optical–optical and optical–SAR pairings.
This cross-sensor pairing is instrumental for evaluating VLMs’ ability to bridge semantic and distributional differences between sensor modalities—a recognized obstacle as evidenced by a 30–40% drop in performance observed on Optical–SAR tasks relative to Optical–Optical (Wang et al., 27 May 2025).
3. Task Taxonomy and Instruction–Response Generation
M3PD supports rigorous benchmarking over nine tasks, described as follows:
- Disaster Scene Recognition (DSR): Multiple-choice identification of disaster type in a scene.
- Disaster Type Recognition (DTR): Classification according to official disaster taxonomy.
- Bearing-Body Recognition (BBR): Identification of key objects (12 man-made/natural classes).
- Damaged Building Counting (DBC): Enumeration of intact, damaged, and destroyed buildings.
- Damaged Road Estimation (DRE): Segmentation and measurement of flooded vs. debris-covered road segments.
- Referring Segmentation (RS): Generation of pixel-level masks for user-specified objects.
- Object Relational Reasoning (ORR): Free-text spatial relation description among disaster-impacted entities.
- Disaster Captioning (DC): Long-form scene summaries integrating observed damage.
- Restoration Advice (RA): Generation of actionable, prioritized recommendations for immediate and longer-term response.
The instruction–response annotation process involves a hybrid LLM–expert pipeline:
- LLMs (principally GPT-4o) produce instruction variants and plausible distractors for Q–A and reasoning tasks.
- Expert annotators curate all ground-truth labels, count-based responses, and refine report/advice texts with reference to international protocols (e.g., UNOSAT, FEMA).
- For SAR imagery, annotation is anchored using co-registered optical data.
Instruction–response pairs span the complexity spectrum, from recognition (e.g., “Select the primary disaster type...”) to composite, scenario-driven prompts demanding multi-step reasoning and report synthesis.
4. Evaluation Protocols and Metrics
Performance evaluation is tightly aligned with each task’s objective:
- Multiple-choice and Counting Tasks (DSR, DTR, BBR, DBC, DRE, ORR): Accuracy (%)
- Segmentation (RS): Class-averaged and aggregated mIoU (mean Intersection over Union)
- Captioning (DC): LLM-assisted ratings on Damage Assessment Precision (DAP), Damage Detail Recall (DDR), and Factual Correctness (FC)—scored on 5-point scales
- Restoration Advice (RA): LLM-assisted multi-aspect scores: Recovery Necessity (RN), Strategic Completeness (SC), Action Priority Precision (APP)
Sample formulae include:
Standard cross-entropy loss is used for classification; Mask2Former losses support mask prediction in segmentation pipelines.
5. Experimental Benchmarks and Model Performance
The M3PD benchmark encompasses 14 advanced VLMs, including LLaVA, InternVL3, Qwen2.5-VL, GeoChat, TeoChat, EarthDial, HyperSeg, PSALM, LISA, GeoPixel, GPT4o, and GPT4.1. Fine-tuning experiments are performed on four representative architectures: Qwen2.5-VL-7B, InternVL3-8B, LISA, and PSALM.
Key findings reveal a substantial gap between generalist VLMs and disaster-trained models:
- Baseline (Qwen2.5-VL-7B): Avg Acc = 31.2%, DBC = 34.2%, DRE = 29.3%, ORR = 23.9%
- Fine-tuned (Qwen2.5-VL-7B): Avg Acc = 40.4% (+9.2), DBC = 34.3% (+0.1), DRE = 29.4% (+0.1), ORR = 36.2% (+12.3)
- Segmentation (PSALM-1.3B): mIoU base = 9.7%, fine-tuned = 50.5% (+40.8)
Captioning and Restoration Advice tasks exhibit notable improvements post-fine-tuning (AVG Captioning +2.1; AVG Advice +1.3 points). Cross-sensor generalization, a critical capability for real-world applicability, improves by 8–12% for QA and 15–28% mIoU for referring segmentation.
A pronounced performance drop is observed when evaluating urban-relief events (earthquake: ~25% QA accuracy) relative to rural, single-object scenarios (landslide: ~50% QA accuracy). These results diagnose persistent challenges in reasoning and object counting under cross-sensor and complex scene conditions.
6. Data Access, Quality Control, and Use Cases
M3PD features extensive quality control through multi-round expert verification and LLM-based generation. The annotation pipeline, which integrates prompt variability, expert labels, and output verification, is systematically partitioned into Instruct (fine-tuning) and Bench (held-out evaluation) splits.
Use cases for the dataset span multiple lines of disaster informatics research:
- Evaluation and improvement of VLM performance in adverse, data-scarce settings
- Cross-hazard and cross-sensor robustness testing
- Benchmarking composite visual–language reasoning in damage assessment, object counting, and structured report generation
- Advancing semi- and fully-automated AI systems for rapid disaster response and situational reporting
M3PD’s publicly available structure and comprehensive benchmarks directly address the lack of disaster-specific, multi-modal vision–language corpora.
7. Significance and Outlook
By establishing a demanding, multi-faceted benchmark rooted in real-world disaster response requirements, M3PD enables systematic evaluation and domain adaptation of VLMs. The dataset’s rich annotation supports granular assessment of model abilities, with fine-tuned VLMs achieving up to +10.4% QA accuracy, +2.1 in report quality, and +40.8% segmentation mIoU improvements. Robust cross-disaster and cross-sensor generalization emerges from targeted fine-tuning, offering concrete pathways to improved decision support in humanitarian and emergency management contexts. The scientific community can leverage M3PD as a foundational resource for developing next-generation AI systems capable of autonomous, scalable disaster assessment at the global level (Wang et al., 27 May 2025).