Instance Segmentation in Aerial Images

Updated 28 February 2026

Instance segmentation in aerial images is the process of delineating each object with pixel-accurate masks, overcoming challenges like extreme scale variation, high density, and irregular geometries.
Key methodological advances include boundary-aware networks, keypoint-to-polygon reconstruction, and implicit mask parameterization to improve accuracy and efficiency in object detection.
These techniques facilitate detailed urban analysis and environmental monitoring, with performance evaluated using metrics such as AP and IoU on specialized benchmarks.

Instance segmentation in aerial images refers to the delineation and identification of each object instance—such as individual buildings, vehicles, ships, or trees—within high-resolution overhead imagery. This problem involves producing a pixel-accurate mask for every object instance in the scene, often under challenging conditions such as dense object distributions, fine-grained or irregular geometry, variable lighting, and strong inter-class visual similarity. In contrast to semantic segmentation, which provides a class label per pixel, instance segmentation must differentiate among all physical instances of the same class, making it a core enabling technology for detailed urban analysis, infrastructure monitoring, environmental studies, and autonomous systems.

1. Distinctive Challenges of Instance Segmentation in Aerial Images

Aerial and satellite imagery present unique operational demands not encountered in typical ground-view computer vision tasks. The iSAID benchmark foregrounds several domain-specific challenges (Zamir et al., 2019):

Extreme object-scale variation: Within a single frame, object areas can span more than four orders of magnitude, with over half (52%) of all instances classified as “small” (<144 px). The largest-to-smallest object area ratio may exceed 20,000.
High object density: Certain images can contain upwards of 8,000 instances, contrasting with the 10–20 typical object count of natural scene datasets such as COCO.
Arbitrary orientations and aspect ratios: Objects, including elongated structures such as bridges, often appear at any in-plane rotation, with aspect ratios up to 90:1.
Abundant tiny and overlapping objects: Small vehicles, ships, and tanks are frequently less than a few dozen pixels in extent, and close proximity or physical contact between instances is common.
Complex backgrounds and high intra/inter-class similarity: Visual confusion between classes (e.g., storage tanks, small vehicles) and with background elements limits the efficacy of generic segmentation architectures.

These factors necessitate algorithmic adaptations and dedicated datasets beyond those tailored for ground-based imagery.

2. Dataset Architectures and Evaluation Metrics

Large-scale, fine-grained annotated benchmarks are essential for algorithm development and evaluation. The iSAID dataset (Zamir et al., 2019) defines the current standard: 655,451 instance annotations across 2,806 high-resolution images (800–13,000 pixels wide) spanning 15 categories.

Annotations use closed polygon masks at the per-pixel level, processed through a rigorous, multi-stage quality-control pipeline. Evaluation employs average precision (AP) under the COCO metric, integrating precision across intersection-over-union (IoU) thresholds from 0.50 to 0.95, as well as AP_S/M/L stratified by object size.

Key results emphasize the limitations of generic approaches. Mask R-CNN achieves only 14.5 AP_S on small objects, while state-of-the-art PANet++ reaches 42.5 for this regime. Significant failure modes remain—fragmentation or missing detections for tightly packed or tiny objects, poor boundary alignment on elongated shapes, and class confusion—signaling a need for aerial-specific algorithmic advances.

Metric	Mask R-CNN	PANet++
Mask AP (overall)	25.7	40.0
Mask AP (small)	14.5	42.5
Mask AP (large)	37.7	43.2

3. Methodological Advances: Model Architectures

Several architectural paradigms have been proposed to address the aerial instance segmentation problem. Selected highlights:

Two-stage mask-based detectors remain foundational. Backbone+FPN+ROIAlign mask architectures, such as Mask R-CNN and HTC (Hybrid Task Cascade), are widely adopted. Extensions such as HTC with residual mask connections further improve small-object separation (Garg et al., 2021).
Boundary-aware multitask networks: B-ResFCN introduces a secondary semantic-boundary detection branch on top of ResNet backbones, substantially improving instance separation (e.g., +4.3 pp F1, Busy Parking Lot UAV dataset) (Mou et al., 2018).
Keypoint-to-polygon reconstruction: Building segmentation via keypoint detection enables sharp boundary recovery and strong geometric fidelity, outperforming per-pixel masks in boundary F-measure and structural similarity (AIRS dataset, 11.29% boundary F1) (Li et al., 2020).
Implicit mask parameterization: Vec2Instance regresses a compact coordinate-MLP parameter vector per centroid, trading a slight recall drop for dramatically reduced model complexity (0.35M vs. 44M parameters, ~4.3h vs 8.6h train time compared to Mask R-CNN on SpaceNet) (Deshapriya et al., 2020).
Fully-convolutional instance heads with “touching-boundary” channels: TernausNetV2 outputs semantic and boundary masks, with post-processing via marker-based watershed, delivering state-of-the-art IoU for building footprint extraction (0.74 IoU on DeepGlobe) (Iglovikov et al., 2018).
Real-time polygon regression: Insta-YOLO proposes direct N-point polygon regression, omitting mask upsampling and orientation angle estimation, achieving 56 FPS and AP₅₀ of 78.16 on aerial ship detection (Mohamed et al., 2021).

Model	Core Innovation	SOTA Task/Metric	Ref
HTC+Enhancement	Retinex-style front-end	+0.9 AP_S in low light	(Garg et al., 2021)
B-ResFCN	Multi-task boundary branch	+4.3 pp F1 dense vehicles	(Mou et al., 2018)
Keypoint-Polygon	Geometric mask from peaks	+1.66 pp boundary F1	(Li et al., 2020)
Vec2Instance	MLP mask decoder	89% pixel accuracy (SN)	(Deshapriya et al., 2020)
Insta-YOLO	Polygon regression head	56 FPS, AP₅₀ 78.16 (Airbus)	(Mohamed et al., 2021)
TernausNetV2	Watershed+boundary mask	0.74 IoU (DeepGlobe)	(Iglovikov et al., 2018)

4. Training Supervision and Weak/Box-Supervised Methods

Full-pixel mask annotation is cost-prohibitive at scale. Weakly-supervised and box-supervised regimes are thus vital:

Deep Level Set (DeepLS): Box-supervised instance segmentation using differentiable curve evolution. The segmentation branch minimizes a level set energy functional driven by box annotations, accruing AP=23.1 (iSAID) versus AP=29.5 for fully supervised Mask R-CNN. DeepLS outperforms affinity-based box-supervised methods (e.g., BoxInst, BBTP) by +3.4 to +5.3 AP, especially on small and thin objects (Li et al., 2021).
Semantic priors for contour evolution: Bayesian fusion frameworks evolve active contours in a MAP setting, fusing CNN appearance priors and GAN-based deep shape models (Polewski et al., 2024). This yields fine contour delineation for irregular objects such as dead tree crowns, surpassing Mask R-CNN and K-net in weighted IoU (+8 pp).
Multiple active contour/rectangle models for elongated objects: For fallen tree stem segmentation, the multi-contour energy combines data fit, shape prior (learned kernel density on [length, width]), overlap penalties, and explicit collinearity to capture elongated geometries (Polewski et al., 2021).

5. Adaptation to Adverse Conditions and Domain Shift

Robust aerial instance segmentation requires adaptation to degraded image conditions and domain mismatch:

Low-light enhancement: Self-supervised Retinex-style enhancement, trained jointly with Mask Cascade segmentation heads, yields a ~2x AP boost for models trained on synthetic low-light data (GAN-generated via CycleGAN), further improved by end-to-end enhancement (+0.9 AP over HTC baseline for box and mask AP) (Garg et al., 2021).
Zero-shot adaptation: The ZoRI framework leverages a Mask2Former+CLIP backbone with a discrimination-enhanced text classifier, knowledge-maintained partial adaptation, and a prior-injected visual cache. Without any mask annotations for unseen classes, it achieves state-of-the-art harmonic mean mAP (e.g., 15.5% vs. 8.8% for previous FC-CLIP, iSAID zero-shot task) (Huang et al., 2024).

6. 3D Lifting, Multi-View Consistency, and Scene Reconstruction

Multi-view and multi-scale inconsistencies in 2D mask predictions can degrade 3D scene understanding:

Aerial Lifting: Novel radiance-field (NeRF) approaches integrate multiple aerial images and noisy 2D labels, lifting semantic and instance segmentation into a 3D volumetric field (Zhang et al., 2024). Scale-adaptive label fusion with synthetic far-views (for large buildings), plus cross-view grouping of SAM masks, reduces label entropy by >30% and improves 3D panoptic quality by 10–20 points compared to previous methods.

7. Methodological Insights, Limitations, and Future Directions

Several methodological themes emerge:

Boundary and geometric priors are essential for high-precision delineation, especially for small, elongated, or irregular instances prevalent in aerial data (Li et al., 2020, Mou et al., 2018).
Anchors and proposal mechanisms benefit from adaptation (multi-scale, orientation-aware proposals) given arbitrary object orientation/distribution (Zamir et al., 2019, Mohamed et al., 2021).
Watershed and contour-based post-processing remain competitive for resolving touching/overlapping instances in high-density domains (Iglovikov et al., 2018, Polewski et al., 2021).
Multi-modal/transfer learning for multispectral data is enabled via encoder adaptation and staged unfreezing (Iglovikov et al., 2018).
Box and weakly-supervised regimes approach mask-supervised fidelity via direct curve evolution and constraint-driven energy terms (Li et al., 2021).
Novel radiance-field approaches integrating 2D and 3D cues are establishing benchmarks for 3D-aware aerial instance segmentation (Zhang et al., 2024).

Persistent limitations include small object detection (especially under occlusion), model adaptation to non-uniform lighting and local tone, computational overhead for shape model evolution, and annotation cost for boundary-accurate masks in new domains.

Emerging directions identified include open-vocabulary, zero-shot, and 3D-aware methods—leveraging foundation models (e.g., CLIP, Mask2Former+SAM), generalized energy-based shape regularization, and real-time or memory-constrained architectures—as well as further integration of spatial and contextual information for robust scene understanding in high-density, high-variation aerial imagery.