iSAID: Aerial Instance Segmentation Dataset

Updated 28 February 2026

iSAID is a large-scale benchmark for aerial instance segmentation that addresses unique challenges such as scale variability, dense object clusters, and class imbalance.
It comprises 2,806 high-resolution images with 655,451 annotated objects using precise polygonal masks and multi-tier quality controls in COCO format.
Evaluation with COCO-style metrics reveals that two-stage and transformer-based models, assisted by tiling and loss reweighting, significantly outperform one-stage baselines.

The Instance Segmentation in Aerial Images Dataset (iSAID) is a large-scale benchmark designed for detailed per-instance segmentation and object detection tasks in high-resolution overhead imagery. It addresses the unique challenges found in aerial scenes, including extreme scale variability, dense object distributions, severe class imbalance, and complex object orientations. iSAID is widely used to evaluate novel architectures and methodologies in both object detection and (semantic/instance) segmentation within the remote sensing community.

1. Dataset Composition and Annotation

iSAID comprises 2,806 high-resolution images with spatial resolutions ranging from approximately $800 \times 800$ to $13,000$ pixels on the long edge. These images are sourced from multiple platforms and sensors, ensuring wide heterogeneity in scene content, altitude, and angle (Zamir et al., 2019). The dataset contains 655,451 annotated object instances across 15 semantically labeled foreground categories:

Plane
Ship
Storage tank
Baseball diamond
Tennis court
Basketball court
Ground track field
Harbor
Bridge
Large vehicle
Small vehicle
Helicopter
Roundabout
Swimming pool
Soccer-ball field

Annotations were produced through a multi-tiered process: expert-designed guidelines, trained annotators with rigorous assessment, and five stages of quality control (self, peer, supervisory and expert review, and statistical outlier checks). Each object is specified by a polygonal mask, rasterized at the original resolution. Annotation guidelines demand precise boundary tracing, even for extremely small ( $\approx$ 10 px) or large ( $\gg$ 1,000 px) objects (Zamir et al., 2019). The annotation files adopt the COCO JSON format, including segmentation polygons, bounding boxes, class IDs (0–14), and image IDs (Demidov et al., 2022).

Dataset splits (exact image counts vary slightly by version and protocol) are approximately:

Split	Images
Training	≈1408–1411
Validation	≈450–458
Testing	≈935–948

Test annotations are withheld for blind benchmarking. Directory structure follows COCO convention, with images and annotation JSONs partitioned by split (Demidov et al., 2022).

2. Unique Challenges of Aerial Instance Segmentation

iSAID formalizes four principal challenges distinct from natural scene benchmarks:

Extreme instance density: On average, each image contains ≈239 annotated objects (maximum ≈8,000), compared to 7.1 for MSCOCO and 2.6 for Cityscapes (Zamir et al., 2019). This demands architectures robust to severe object overlap and clustering.
Scale variation and size distribution: Object sizes range from 10×10 pixels to many thousands of pixels. The global area ratio $A_{max} / A_{min} \lesssim 2\times 10^4$ , with 52% of instances "small" (area $10 \leq a < 144$ ), 33.7% "medium" ( $144 \leq a < 1,024$ ), 9.7% "large" ( $a \geq 1,024$ ). Processing requires tiling strategies and multi-scale feature integration (Zamir et al., 2019, Demidov et al., 2022).
Severe class imbalance: Some classes (e.g., basketball court, ground track field) have under 1,000 instances; "small vehicle" alone comprises over 700,000 labels. The observed class imbalance ratio approaches 4,680:1 between the most and least frequent categories (Demidov et al., 2022).
Arbitrary orientation and large aspect ratios: Aspect ratios reach up to 90:1. Oriented object representation or rotation-equivariant models are required to handle variably oriented and elongated objects (Zamir et al., 2019).

These factors conspire to make conventional approaches from natural-scene instance segmentation suboptimal on iSAID.

3. Evaluation Protocols and Metrics

iSAID employs COCO-style metrics for both detection and segmentation. Core evaluation protocols include (Zamir et al., 2019, Demidov et al., 2022, Dahal et al., 2024):

Intersection over Union (IoU):

$\text{IoU} = \frac{\mathrm{Area}(B_{\mathrm{pred}}\cap B_{\mathrm{gt}})}{\mathrm{Area}(B_{\mathrm{pred}}\cup B_{\mathrm{gt}})}$

Average Precision (AP): Computed as the mean precision over recall for thresholds in $[0.50, 0.95]$ (in steps of $13,000$0), with $13,000$1 and $13,000$2 reported.
Mean Average Precision (mAP): Average of per-class AP over all IoU thresholds.
Instance-size specific AP: $13,000$3 for small, medium, and large masks.
Semantic Segmentation: Mean Intersection-over-Union (mIoU), Dice score, and entropy-based losses.
- mIoU:
$13,000$4 - Dice score:

$13,000$5

Objects with less than 10 pixels in any dimension are excluded from official detection and segmentation evaluation (Demidov et al., 2022).

4. Baseline Methods and Model Performance

Performance baselines span both detection and segmentation, evaluated on strategic crops (typically $13,000$6 to $13,000$7) due to hardware constraints (Demidov et al., 2022, Zamir et al., 2019). Representative models and results include:

Object Detection and Instance Segmentation

Model (Backbone)	[email protected]:.95	[email protected]	[email protected]	AP_S	AP_M	AP_L
Mask R-CNN (ResNet-50)	32.2%	54.6%	34.5%	—	—	—
PANet (ResNet-101)	34.17%	56.57%	35.84%	19.56%	42.27%	46.62%
PANet++ (ResNet-152)	40.0%	64.54%	42.5%	42.46%	54.74%	43.16%
Faster R-CNN + FPN	29.7%	52.3%	31.8%	—	—	—
YOLOv3	14.4%	36.2%	11.8%	—	—	—

Semantic Segmentation (mIoU, Dice):

Model	mIoU	Dice	Params	Inference (s/img)
UNet (CNN, fused loss)	73.4%	74%	42.9M	0.19
MaskFormer (ViT, Swin-L)	82.48%	82%	200M (frozen)	0.29
AerialFormer-B (ViT, SOTA)	69.3%	—	113.8M	—

Two-stage and attention-based models consistently outperform one-stage baselines by ≥10 mAP. Incorporation of attention (CBAM, SENet, etc.) and weighted sampling/focal losses yields measurable gains in the presence of class imbalance and small-object prevalence (Demidov et al., 2022, Dahal et al., 2024).

5. Model Modifications, Losses, and Practical Recommendations

Modeling adaptations for iSAID:

Tiling/Sliding-window inference: Essential for retaining small-object fidelity. Standard crop sizes are $13,000$8 (stride 512‒800) (Demidov et al., 2022).
Feature Pyramid Networks (FPN): Enable multi-scale aggregation.
Weighted attention FPN: Learnable per-scale fusion; weights $13,000$9 computed as:

$\approx$ 0

where $\approx$ 1 denotes global average pooling.

Class-balanced sampling: Probability for class $\approx$ 2,

$\approx$ 3

Counteracting the long-tailed class distribution.

Weighted focal loss:

$\approx$ 4

Density prediction head: Auxiliary head computes density maps $\approx$ 5 with $\approx$ 6 penalty versus Gaussianized ground-truth (Demidov et al., 2022).

For semantic segmentation:

Fused loss function:

$\approx$ 7

with $\approx$ 8 and $\approx$ 9 as standard per-pixel set similarities, excluding background.

Augmentation: Random cropping, flipping, full $\gg$ 0 rotation, and normalization to ImageNet statistics are effective (Dahal et al., 2024).

6. Empirical Insights and Best Practices

Two-stage detectors augmented with mask (instance) and attention heads yield mAP in the 30–37 range, while one-stage methods trail by ≈10–15 points (Demidov et al., 2022).
Semantic segmentation models based on transformer backbones (MaskFormer, AerialFormer) exhibit superior mIoU and Dice, but require greater resources (Dahal et al., 2024).
Class imbalance mitigation (sampling, loss reweighting) improves mAP by 3–5 points and significantly boosts rare-class recall (Demidov et al., 2022).
Sliding window and crop-based inference preserve both context and spatial resolution, enabling models to detect extremely small objects while maintaining global structure (Zamir et al., 2019).
Integration of fused Dice/IoU/Cross-entropy loss leads to notable absolute mIoU increases ( $\gg$ 1+15%) over standard cross-entropy in CNNs (Dahal et al., 2024).

7. Research Impact and Directions

iSAID has established itself as the primary benchmark for dense, high-resolution aerial imagery segmentation and detection (Zamir et al., 2019). Its use catalyzed new model developments, including scale-robust backbones, rotation-equivariant networks, and loss/augmentation strategies tailored to the non-i.i.d. scale and orientation of aerial scenes. The challenge of small-object segmentation, dense clustering, and class imbalance evident in benchmark results continues to drive research in specialized NMS, feature/exemplar re-weighting, and domain adaptation. Over 80% of state-of-the-art results on iSAID now utilize attention mechanisms or transformer architectures (Dahal et al., 2024). The dataset remains actively maintained, with public code and evaluation servers supporting reproducible research.

For further reference, the dataset and benchmark details, including download links and evaluation protocols, are available at https://captain-whu.github.io/iSAID/index.html (Zamir et al., 2019).

Markdown Report Issue Upgrade to Chat

References (3)

iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images (2019)

Object Detection in Aerial Imagery (2022)

Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to iSAID Dataset.