Emergency Landing Site Selection Benchmark

Updated 8 February 2026

The benchmark defines clear geometric and semantic criteria for identifying safe emergency landing sites using multi-source imagery and expert annotations.
It integrates classical geometry-based approaches with modern deep learning methods, employing metrics like IoU, Dice coefficient, and precision to quantify performance.
The framework ensures reproducible research by offering open datasets, standardized evaluation protocols, and transparent baseline methods across diverse aviation scenarios.

Emergency Landing Site Selection (ELSS) Benchmark refers to a standardized suite of datasets, evaluation protocols, and baseline methods for scientifically assessing algorithms designed to identify, rank, and justify potential landing sites for crewed or uncrewed aircraft following a loss-of-thrust or emergency event. ELSS benchmarks systematically measure algorithmic performance in both geometric (flatness/obstacle-based) and semantically-aware (contextual/crowd-aware) settings, supporting reproducibility, ablation, and fair comparison for diverse research communities spanning computer vision, robotics, aviation safety, and large-model interpretability.

1. Fundamental Principles of ELSS Benchmarking

The ELSS benchmark concept encompasses a multi-modal assessment pipeline combining spatial sensing, semantic reasoning, and formalized ground-truth annotation. Core requirements include:

Multi-source imagery: orthophotos (RGB/NIR), high-resolution DSM/DTM, or top-down RS data
Geometric suitability: flatness, local slope, roughness, and obstacle-free verification, critically specified using tile-based or gradient-derivative criteria
Semantic hazard awareness: labeling and exclusion of unsafe sites due to transient or contextual risk (e.g., crowds, schools, dynamic hazards)
Standardized, open datasets: labeled candidate regions with "safe"/"unsafe" annotation based on expert or regulatory guidelines (e.g., JARUS SORA standards)
Transparent, formalized metrics: precision, recall, intersection over union (IoU), Dice coefficient, right/wrong-rate for ranking, and computational resource accounting

This multi-faceted approach ensures rigorous, domain-relevant evaluation across UAV, fixed-wing, and infrastructure-free scenarios (Klos et al., 2020, Hua et al., 1 Feb 2026).

2. Benchmark Dataset Construction and Annotation

Rigorous dataset construction is the foundation of ELSS benchmarks.

Input Layers: Datasets typically integrate digital orthophotos (including RGB and NIR channels at 0.2 m GSD), interpolated DSMs (point clouds, rasterized to 1 m/pixel), and derived layers (slope, roughness, NDVI via PostGIS spatial functions). Semantic classes are extracted from public RS contests or large urban imagery (Klos et al., 2020, Hua et al., 1 Feb 2026).
Sample Definition: Candidate landing sites are defined as manually labeled polygons of variable square meterage (e.g., 32–256 m²). For each, matched tiles across data layers are generated and further split into square search windows (e.g., 8, 16, 32 m² patches with stride SW/2), maintaining class balance and geographic coverage integrity.
Annotation Protocols: Ground-truth "safe"/"unsafe" binary labels are assigned, with additional expert labels specifying hazards (dynamic: crowds/vehicles; static: schools, gas stations, transient structures). Inter-annotator agreement is cross-validated to exceed 95% (Hua et al., 1 Feb 2026).
Geographic Diversity: Datasets span both high-density urban (e.g., ISPRS Potsdam, 0.05 m/px) and peri-urban or rural contexts (e.g., Nanjing, 0.3 m/px), including real and synthetic DSMs (Klos et al., 2020, Hua et al., 1 Feb 2026, Kakaletsis et al., 2021).

This systematic annotation supports robust model evaluation and transfer learning.

3. Core Algorithmic Workflows and Baseline Methods

Benchmark protocols specify both classical and deep-learning–based workflows:

Geometric Baselines: Methods apply local-slope estimation (via Sobel gradient, Eqn. 1 in (Kakaletsis et al., 2021)), global thresholding on slope (T_slope~5–10°), connected-components analysis to isolate contiguous, flat candidate regions, and height-difference masks for obstacle exclusion. Minimum area constraints (A_min, e.g., ≥25 m²) enforce candidate suitability.
Transfer Learning & Ensemble Models: Deep architectures (e.g., AlexNet, Wide-ResNet-50-2) are fine-tuned with multi-layer raster inputs. Full fine-tuning ("feature_extraction = False") consistently outperforms partial adaptation (Klos et al., 2020). Hyperparameters are optimized via Bayesian (GP/Matérn kernel) and Bandit/Thompson Sampling search strategies targeting validation accuracy/loss. Hierarchical ensemble voting is used for multi-scale fusion, with confidence quantification via the normalized softmax margin.

Model	Layers Used	Tile Size (SW)	Key Hyperparameters	Test Accuracy / Precision
AlexNet	RGB + NDVI	8 m²	lr=1.32e–2, wd=7.66e–8, Adadelta	99.958% / 99.984%
Wide-ResNet-50-2	all layers	16 m²	lr=8.18e–6, wd=3.14e–4, AdamW	99.959% / 99.976%
AlexNet	all layers	32 m²	lr=3.09e–5, wd=2.12e–2, AdamW	99.940% / 99.940%

Semantic and Multimodal Approaches: Recent benchmarks integrate a two-stage pipeline: Dense semantic segmentation (e.g., DeepLabV3+ with ASPP), followed by patch-wise visual verification with Multimodal LLMs (MLLMs). These models fuse vision, Point-of-Interest (POI) metadata (proximity to schools, gas stations, dynamic hazards), and contextual cues (time-of-day, events), producing both safety scores and natural-language justifications. Tabu suppression and context-aware scoring ensure interpretable, ranked candidate lists. Incorporation of POI features demonstrably improves ranking right-rate by >20% compared to vision only (Hua et al., 1 Feb 2026).
2D/3D Fusion Baselines: Patch-level safety evaluation combines semantic segmentation (e.g., BiSeNetV2) with stereo-derived 3D geometric filters (RANSAC for plane fitting, slope/roughness extraction), culminating in sliding-window site selection employing tunable distance-to-drone and obstacle-clearance cost functions (Secchiero et al., 2023).

4. Evaluation Protocols and Performance Metrics

Benchmarking in ELSS employs diverse, scenario-specific metrics:

Patchwise Segmentation: Standard true/false positive/negative counts per tile, with Intersection over Union (IoU), Dice coefficient, precision, and recall summarizing classifier efficacy. For example, SW 8 m² ensemble achieves IoU >0.99, indicating high spatial congruence (Klos et al., 2020).
Ranking & Suitability: In semantically-aware setups, candidate sets are ranked using Right Rate (correct identification of both safest and riskiest sites per query), False Rate (swapping errors), and passing rate (% of candidates marked "Safe" by the system) (Hua et al., 1 Feb 2026).
People-Avoidance: For Safe Landing Zone (SLZ) problems, metrics include number of head-occupancy violations (warning and danger regions), mean SLZ area, IoU with ground-truth exclusion masks, and pipeline throughput (e.g., 0.11 s/frame for density-map–based detection, (Tovanche-Picon et al., 2022)).
Geometric Baseline Metrics: Precision and recall against manually digitized or simulation-based ground-truth, analyzed per dataset (urban/rural/synthetic). For example, DEM-gradient approaches achieve up to 0.871 precision (rural) and 0.929 (synthetic) (Kakaletsis et al., 2021).
Computational Load: Epoch/training times and hardware resource usage are reported (e.g., 17 s/epoch for hyperopt trial, full ensemble training on moderate Kubernetes clusters, or real-time inference at ~1.1 Hz on Jetson platforms) (Klos et al., 2020, Secchiero et al., 2023).

5. Interpretability, Limitations, and Failure Analysis

Interpretability is explicitly integrated through methods such as Layer-wise Relevance Propagation (LRP), which diagnoses network focus (e.g., shadow sensitivity vs. slope attention) (Klos et al., 2020). Multimodal frameworks supplement numerical ratings with natural-language rationales, supporting trust and regulatory alignment (Hua et al., 1 Feb 2026).

Common limitations include:

Geometric-only pipelines cannot handle dynamic or semantic hazards (moving people, temporary structures), and may misclassify low vegetation or occluded obstacles. Coarser DSM resolution increases risk of small-object misses (Kakaletsis et al., 2021).
Discrete thresholds for slope/roughness may generate false positives/negatives at threshold boundaries or when grid resolution is suboptimal (Secchiero et al., 2023).
Transfer learning approaches may degrade under domain shift; explicit POI and semantic context are necessary for urban-scale safe recommendations (Hua et al., 1 Feb 2026).
2D vision-only pipelines lack topographic awareness relevant to fixed-wing aircraft glideslope constraints, motivating integration with flight-physics/fixed-wing reachability models (as in the MASC framework) (Gu et al., 2022).

6. Benchmark Releases and Reproducibility

Benchmark datasets and codebases are openly released to enable standardized comparison and ablation.

ELSS datasets include high-resolution RS imagery, POI maps, expert binary labels, and candidate bounding boxes in GeoTIFF, CSV, and JSON formats (Hua et al., 1 Feb 2026).
Reference implementations for both semantic segmentation and MLLM evaluation, plus scoring scripts for key metrics, are provided (e.g., https://anonymous.4open.science/r/ELSS-dataset-43D7, https://github.com/IDEA-UAV/ELSS) (Hua et al., 1 Feb 2026).
Kubernetes-based training set-ups, model hyperparameter configurations, and interpretability toolchains (LRP) are specified for replication (Klos et al., 2020).
Real-world and synthetic indoor datasets—including segmentation, 3D map, and grid accuracy/success rates—are available to the research community for apples-to-apples evaluation (Secchiero et al., 2023, Tovanche-Picon et al., 2022).
Experimental protocols encourage reporting of landing success, IOU/precision/recall, and computational constraints.

7. Representative Applications and Future Directions

ELSS benchmark frameworks are instrumental in:

Autonomous UAV and eVTOL emergency response, especially over mixed urban, peri-urban, or rural environments
Vision-based risk mitigation in scenarios with transient human presence, leveraging density maps and context fusion for crowd avoidance
Enhanced path planning, trajectory optimization, and real-time feasibility analysis exploiting lightweight flight-dynamic models and candidate pre-filtering (Gu et al., 2022).
Integration with GIS/POI databases and regulatory compliance workflows (JARUS SORA, buffer enforcement)

The field is converging toward fully multimodal, context-fused, interpretable assessment pipelines, with strong evidence for the superiority of semantic and large-model approaches in real-world, semantically complex environments. Ongoing ELSS benchmark development is expected to emphasize: richer context-labels, fine-grained ablation, broader cross-region transfer, and unified support for both rotary- and fixed-wing aircraft (Hua et al., 1 Feb 2026, Klos et al., 2020, Secchiero et al., 2023, Gu et al., 2022, Kakaletsis et al., 2021).