Emergency Landing Site Selection Benchmark
- The benchmark defines clear geometric and semantic criteria for identifying safe emergency landing sites using multi-source imagery and expert annotations.
- It integrates classical geometry-based approaches with modern deep learning methods, employing metrics like IoU, Dice coefficient, and precision to quantify performance.
- The framework ensures reproducible research by offering open datasets, standardized evaluation protocols, and transparent baseline methods across diverse aviation scenarios.
Emergency Landing Site Selection (ELSS) Benchmark refers to a standardized suite of datasets, evaluation protocols, and baseline methods for scientifically assessing algorithms designed to identify, rank, and justify potential landing sites for crewed or uncrewed aircraft following a loss-of-thrust or emergency event. ELSS benchmarks systematically measure algorithmic performance in both geometric (flatness/obstacle-based) and semantically-aware (contextual/crowd-aware) settings, supporting reproducibility, ablation, and fair comparison for diverse research communities spanning computer vision, robotics, aviation safety, and large-model interpretability.
1. Fundamental Principles of ELSS Benchmarking
The ELSS benchmark concept encompasses a multi-modal assessment pipeline combining spatial sensing, semantic reasoning, and formalized ground-truth annotation. Core requirements include:
- Multi-source imagery: orthophotos (RGB/NIR), high-resolution DSM/DTM, or top-down RS data
- Geometric suitability: flatness, local slope, roughness, and obstacle-free verification, critically specified using tile-based or gradient-derivative criteria
- Semantic hazard awareness: labeling and exclusion of unsafe sites due to transient or contextual risk (e.g., crowds, schools, dynamic hazards)
- Standardized, open datasets: labeled candidate regions with "safe"/"unsafe" annotation based on expert or regulatory guidelines (e.g., JARUS SORA standards)
- Transparent, formalized metrics: precision, recall, intersection over union (IoU), Dice coefficient, right/wrong-rate for ranking, and computational resource accounting
This multi-faceted approach ensures rigorous, domain-relevant evaluation across UAV, fixed-wing, and infrastructure-free scenarios (Klos et al., 2020, Hua et al., 1 Feb 2026).
2. Benchmark Dataset Construction and Annotation
Rigorous dataset construction is the foundation of ELSS benchmarks.
- Input Layers: Datasets typically integrate digital orthophotos (including RGB and NIR channels at 0.2 m GSD), interpolated DSMs (point clouds, rasterized to 1 m/pixel), and derived layers (slope, roughness, NDVI via PostGIS spatial functions). Semantic classes are extracted from public RS contests or large urban imagery (Klos et al., 2020, Hua et al., 1 Feb 2026).
- Sample Definition: Candidate landing sites are defined as manually labeled polygons of variable square meterage (e.g., 32–256 m²). For each, matched tiles across data layers are generated and further split into square search windows (e.g., 8, 16, 32 m² patches with stride SW/2), maintaining class balance and geographic coverage integrity.
- Annotation Protocols: Ground-truth "safe"/"unsafe" binary labels are assigned, with additional expert labels specifying hazards (dynamic: crowds/vehicles; static: schools, gas stations, transient structures). Inter-annotator agreement is cross-validated to exceed 95% (Hua et al., 1 Feb 2026).
- Geographic Diversity: Datasets span both high-density urban (e.g., ISPRS Potsdam, 0.05 m/px) and peri-urban or rural contexts (e.g., Nanjing, 0.3 m/px), including real and synthetic DSMs (Klos et al., 2020, Hua et al., 1 Feb 2026, Kakaletsis et al., 2021).
This systematic annotation supports robust model evaluation and transfer learning.
3. Core Algorithmic Workflows and Baseline Methods
Benchmark protocols specify both classical and deep-learning–based workflows:
- Geometric Baselines: Methods apply local-slope estimation (via Sobel gradient, Eqn. 1 in (Kakaletsis et al., 2021)), global thresholding on slope (T_slope~5–10°), connected-components analysis to isolate contiguous, flat candidate regions, and height-difference masks for obstacle exclusion. Minimum area constraints (A_min, e.g., ≥25 m²) enforce candidate suitability.
- Transfer Learning & Ensemble Models: Deep architectures (e.g., AlexNet, Wide-ResNet-50-2) are fine-tuned with multi-layer raster inputs. Full fine-tuning ("feature_extraction = False") consistently outperforms partial adaptation (Klos et al., 2020). Hyperparameters are optimized via Bayesian (GP/Matérn kernel) and Bandit/Thompson Sampling search strategies targeting validation accuracy/loss. Hierarchical ensemble voting is used for multi-scale fusion, with confidence quantification via the normalized softmax margin.
| Model | Layers Used | Tile Size (SW) | Key Hyperparameters | Test Accuracy / Precision |
|---|---|---|---|---|
| AlexNet | RGB + NDVI | 8 m² | lr=1.32e–2, wd=7.66e–8, Adadelta | 99.958% / 99.984% |
| Wide-ResNet-50-2 | all layers | 16 m² | lr=8.18e–6, wd=3.14e–4, AdamW | 99.959% / 99.976% |
| AlexNet | all layers | 32 m² | lr=3.09e–5, wd=2.12e–2, AdamW | 99.940% / 99.940% |
- Semantic and Multimodal Approaches: Recent benchmarks integrate a two-stage pipeline: Dense semantic segmentation (e.g., DeepLabV3+ with ASPP), followed by patch-wise visual verification with Multimodal LLMs (MLLMs). These models fuse vision, Point-of-Interest (POI) metadata (proximity to schools, gas stations, dynamic hazards), and contextual cues (time-of-day, events), producing both safety scores and natural-language justifications. Tabu suppression and context-aware scoring ensure interpretable, ranked candidate lists. Incorporation of POI features demonstrably improves ranking right-rate by >20% compared to vision only (Hua et al., 1 Feb 2026).
- 2D/3D Fusion Baselines: Patch-level safety evaluation combines semantic segmentation (e.g., BiSeNetV2) with stereo-derived 3D geometric filters (RANSAC for plane fitting, slope/roughness extraction), culminating in sliding-window site selection employing tunable distance-to-drone and obstacle-clearance cost functions (Secchiero et al., 2023).
4. Evaluation Protocols and Performance Metrics
Benchmarking in ELSS employs diverse, scenario-specific metrics:
- Patchwise Segmentation: Standard true/false positive/negative counts per tile, with Intersection over Union (IoU), Dice coefficient, precision, and recall summarizing classifier efficacy. For example, SW 8 m² ensemble achieves IoU >0.99, indicating high spatial congruence (Klos et al., 2020).
- Ranking & Suitability: In semantically-aware setups, candidate sets are ranked using Right Rate (correct identification of both safest and riskiest sites per query), False Rate (swapping errors), and passing rate (% of candidates marked "Safe" by the system) (Hua et al., 1 Feb 2026).
- People-Avoidance: For Safe Landing Zone (SLZ) problems, metrics include number of head-occupancy violations (warning and danger regions), mean SLZ area, IoU with ground-truth exclusion masks, and pipeline throughput (e.g., 0.11 s/frame for density-map–based detection, (Tovanche-Picon et al., 2022)).
- Geometric Baseline Metrics: Precision and recall against manually digitized or simulation-based ground-truth, analyzed per dataset (urban/rural/synthetic). For example, DEM-gradient approaches achieve up to 0.871 precision (rural) and 0.929 (synthetic) (Kakaletsis et al., 2021).
- Computational Load: Epoch/training times and hardware resource usage are reported (e.g., 17 s/epoch for hyperopt trial, full ensemble training on moderate Kubernetes clusters, or real-time inference at ~1.1 Hz on Jetson platforms) (Klos et al., 2020, Secchiero et al., 2023).
5. Interpretability, Limitations, and Failure Analysis
Interpretability is explicitly integrated through methods such as Layer-wise Relevance Propagation (LRP), which diagnoses network focus (e.g., shadow sensitivity vs. slope attention) (Klos et al., 2020). Multimodal frameworks supplement numerical ratings with natural-language rationales, supporting trust and regulatory alignment (Hua et al., 1 Feb 2026).
Common limitations include:
- Geometric-only pipelines cannot handle dynamic or semantic hazards (moving people, temporary structures), and may misclassify low vegetation or occluded obstacles. Coarser DSM resolution increases risk of small-object misses (Kakaletsis et al., 2021).
- Discrete thresholds for slope/roughness may generate false positives/negatives at threshold boundaries or when grid resolution is suboptimal (Secchiero et al., 2023).
- Transfer learning approaches may degrade under domain shift; explicit POI and semantic context are necessary for urban-scale safe recommendations (Hua et al., 1 Feb 2026).
- 2D vision-only pipelines lack topographic awareness relevant to fixed-wing aircraft glideslope constraints, motivating integration with flight-physics/fixed-wing reachability models (as in the MASC framework) (Gu et al., 2022).
6. Benchmark Releases and Reproducibility
Benchmark datasets and codebases are openly released to enable standardized comparison and ablation.
- ELSS datasets include high-resolution RS imagery, POI maps, expert binary labels, and candidate bounding boxes in GeoTIFF, CSV, and JSON formats (Hua et al., 1 Feb 2026).
- Reference implementations for both semantic segmentation and MLLM evaluation, plus scoring scripts for key metrics, are provided (e.g., https://anonymous.4open.science/r/ELSS-dataset-43D7, https://github.com/IDEA-UAV/ELSS) (Hua et al., 1 Feb 2026).
- Kubernetes-based training set-ups, model hyperparameter configurations, and interpretability toolchains (LRP) are specified for replication (Klos et al., 2020).
- Real-world and synthetic indoor datasets—including segmentation, 3D map, and grid accuracy/success rates—are available to the research community for apples-to-apples evaluation (Secchiero et al., 2023, Tovanche-Picon et al., 2022).
- Experimental protocols encourage reporting of landing success, IOU/precision/recall, and computational constraints.
7. Representative Applications and Future Directions
ELSS benchmark frameworks are instrumental in:
- Autonomous UAV and eVTOL emergency response, especially over mixed urban, peri-urban, or rural environments
- Vision-based risk mitigation in scenarios with transient human presence, leveraging density maps and context fusion for crowd avoidance
- Enhanced path planning, trajectory optimization, and real-time feasibility analysis exploiting lightweight flight-dynamic models and candidate pre-filtering (Gu et al., 2022).
- Integration with GIS/POI databases and regulatory compliance workflows (JARUS SORA, buffer enforcement)
The field is converging toward fully multimodal, context-fused, interpretable assessment pipelines, with strong evidence for the superiority of semantic and large-model approaches in real-world, semantically complex environments. Ongoing ELSS benchmark development is expected to emphasize: richer context-labels, fine-grained ablation, broader cross-region transfer, and unified support for both rotary- and fixed-wing aircraft (Hua et al., 1 Feb 2026, Klos et al., 2020, Secchiero et al., 2023, Gu et al., 2022, Kakaletsis et al., 2021).