Papers
Topics
Authors
Recent
2000 character limit reached

UMD Dataset: Underground Mine Disaster Insights

Updated 16 December 2025
  • The Underground Mine Disaster (UMD) Dataset is a vision-language corpus containing 300 RGB images and 1,500 expert-authored captions detailing post-disaster underground mining scenes.
  • It provides balanced annotations across six incident types, enabling advanced hazard detection, emergency response planning, and anomaly identification under challenging visual conditions.
  • Pre-processed with fixed dimensions, color normalization, and uniform captioning protocols, the dataset supports rigorous benchmarking using metrics like CIDEr and SPICE.

The Underground Mine Disaster (UMD) Dataset is a domain-specific vision-language corpus constructed to enable multimodal reasoning and robust situational awareness in extreme underground mining disaster conditions. It is the first curated image-caption dataset focusing explicitly on the complexities of post-disaster subterranean environments, targeting the development and benchmarking of captioning models and downstream applications such as hazard localization, automated response planning, and anomaly detection. UMD’s suite of annotated RGB images, reflecting real-world and simulated underground disasters, addresses the domain’s acute visual degradation, occlusion, and illumination challenges, supporting novel approaches for vision-language alignment and caption generation under operational constraints (Jewel et al., 9 Dec 2025).

1. Dataset Composition and Structure

UMD comprises 300 de-duplicated RGB images, each representing post-disaster scenarios in underground mining contexts, with exactly five expert-authored captions per image—yielding a total of 1,500 captions. Scene coverage is balanced across six major incident types: structural collapse, flooding, equipment damage, gas leaks, fire/smoke, and rescue operations, sampled at approximately 50 images per category. Statistical rigor is ensured with an average of 5.0 captions ±0.3 per image; all images hold between four and six captions, meeting uniformity standards for supervised training.

Each image is processed to fixed dimensions (224 × 224 pixels) with color normalization (histogram-matched to a reference underground scan), border-trimming, and format conversion to standard RGB arrays. Source imagery includes JPEG and PNG files unified post-processing for compatibility with CLIP-ViT-based encoders. Environmental fidelity is maintained, with images capturing extremes ranging from near-total darkness (with limited lamp illumination) through smoke/dust-laden air to standing floodwaters and mixed-illumination regimes (Jewel et al., 9 Dec 2025).

2. Data Acquisition and Pre-processing

Imagery for UMD is aggregated from three principal channels:

  • Real disaster photographs acquired through targeted web searches on underground mining events.
  • Cinematic recreations with high photorealism simulating underground catastrophes (notably, disaster films).
  • Extracts from public disaster datasets (Image-Mine, DNICC19k, Incidents1M), filtered for relevant subterranean content.

Pre-processing encompasses:

  1. Near-duplicate removal.
  2. Edge-trimming to cleanse extraneous elements.
  3. Histogram-based color normalization.
  4. Resolution down-sampling to 224 × 224 pixels.

This pipeline eliminates trivial domain shifts between real and synthetic or cinematic imagery, providing a consistent vision model input space even under severe environmental degradation, such as airborne dust occlusion, non-uniform lighting, and cluttered, confined geometries (Jewel et al., 9 Dec 2025).

3. Caption Annotation, Quality Assurance, and Protocols

Captions are authored by annotators trained in mining safety and computer vision. Annotation protocols mandate:

  • Explicit mention of primary hazards (“collapsed beam”, “rising floodwater”, etc.).
  • Identification of key structures or objects (e.g., pillars, equipment, lighting arrays).
  • Noting observable indicators of structural failure or human presence.
  • Consistent sentence lengths (8–15 words) and accurate, domain-specific terminology.

A two-tiered QA process is implemented: all captions undergo initial verification by a lead annotator for semantic richness and redundancy filtering, followed by random spot-checks (20% sample) for guideline adherence. Estimated inter-reviewer consistency (Cohen’s κ ≈ 0.78 on a 50-image pilot) demonstrates substantial agreement on key hazard/object cues, though explicit inter-annotator calculations are not provided (Jewel et al., 9 Dec 2025).

4. Benchmark Splits, Accessibility, and Comparative Performance

The dataset is partitioned into stratified splits: 180 images for training (900 captions), 60 for validation (300 captions), and 60 for testing (300 captions). Each disaster type is proportionally represented in every split to maintain balanced cross-validation. UMD is released under a CC-BY-NC-4.0 license, restricting commercial redistribution and hosted at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer (Jewel et al., 9 Dec 2025).

Captioning performance is benchmarked primarily via CIDEr and SPICE:

  • CIDEr quantifies human-reference consensus using TF-IDF weighted n-gram similarity:

CIDEr(c,{rj})=1mj=1mωΩgc(ω)grj(ω)gc  grj ⁣,\mathrm{CIDEr}(c,\{r_j\}) = \frac{1}{m}\sum_{j=1}^{m} \frac{\sum_{\omega\in\Omega}g_c(\omega)\,g_{r_j}(\omega)} {\|g_c\|\;\|g_{r_j}\|}\!,

where gc(ω)g_c(\omega) is the n-gram TF-IDF in candidate cc and Ω\Omega is the n-gram set.

  • SPICE analyses scene-graph tuples (objects, attributes, relationships), calculating F-score over matching sets.

Baseline results:

Model CIDEr SPICE
BLIP-2 0.62 0.50
Flamingo 0.53 0.42
Florence 0.58 0.47
LLaVA 0.59 0.47
Qwen-2.5 0.66 0.51
MDSE (Ours) 0.70 0.53

MDSE achieves superior performance by integrating segmentation-aware dual-pathway visual encoding and context-aware cross-attention, surpassing both generalist and disaster-specific captioning models (Jewel et al., 9 Dec 2025).

5. Task-Specific Challenges and Domain Relevance

UMD directly addresses three domain-specific challenges:

  • Extreme low-light and variable illumination regimes (spot lighting vs. deep shadow).
  • Heavy particulate occlusion (dust, smoke) that degrades edge and texture visibility.
  • Spatial clutter and scene overlap (multiple, concurrent hazards such as collapsed beams and standing water).

The dataset’s comprehensive, expert-driven annotations enable development and evaluation of models for:

  • Automatic hazard detection and localization.
  • Emergency response route optimization.
  • Post-disaster anomaly surveying.
  • Real-time support for robotics and teleoperation under visually degraded conditions.

Its relatively small scale reflects the intrinsic difficulty of acquiring authentic underground disaster imagery while remaining sufficient for benchmarking vision-language architectures subjected to severe perceptual noise (Jewel et al., 9 Dec 2025).

6. Relationship to Complementary Datasets and Multi-Modal Extensions

UMD is complemented by the Thermal Underground Human Detection (Thermal UHD) dataset, which provides 7,049 annotated thermal images of miners—supporting miner detection and localization in low-light, emergency contexts using bounding-box annotations and a broad pose taxonomy (standing, bending, sitting, squatting, lying). While UMD is RGB-caption-centric and targets vision-language generation, Thermal UHD enables robust object detection under conditions where RGB vision is impaired, such as heavy smoke or darkness. Notably, Thermal UHD benchmarking employs YOLO and RT-DETR architectures, providing mAP50_{50} and F1_1 metrics for various network sizes and advocating multi-modal fusion (thermal + RGB + LiDAR) to further enhance UMD model robustness. Future development directions include synchronized multi-modal data acquisition and fine-grained annotations (pixel-wise segmentation, key-point pose) to bridge scene understanding with actionable human detection in integrated emergency response workflows (Addy et al., 26 Jun 2025).

7. Limitations and Future Directions

UMD’s scarcity (300 images) reflects the operational limits of obtaining real post-disaster visual data in underground mining. Its fixed image resolution and reliance on RGB imaging expose potential vulnerabilities for tasks demanding fine-grained localization under total visual occlusion. A plausible implication is that supplementing UMD with synthetic data, thermal imagery (as in Thermal UHD), LiDAR, and multi-modal ground truth could broaden model applicability. Further, integrating richer annotation schemes—segmentation masks, action labels, and temporal event sequences—will facilitate advanced hazard detection, rapid route planning, and robotics deployment in diverse underground environments (Jewel et al., 9 Dec 2025, Addy et al., 26 Jun 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Underground Mine Disaster (UMD) Dataset.