iSAID: Aerial Instance Segmentation Dataset
- iSAID is a large-scale benchmark for aerial instance segmentation that addresses unique challenges such as scale variability, dense object clusters, and class imbalance.
- It comprises 2,806 high-resolution images with 655,451 annotated objects using precise polygonal masks and multi-tier quality controls in COCO format.
- Evaluation with COCO-style metrics reveals that two-stage and transformer-based models, assisted by tiling and loss reweighting, significantly outperform one-stage baselines.
The Instance Segmentation in Aerial Images Dataset (iSAID) is a large-scale benchmark designed for detailed per-instance segmentation and object detection tasks in high-resolution overhead imagery. It addresses the unique challenges found in aerial scenes, including extreme scale variability, dense object distributions, severe class imbalance, and complex object orientations. iSAID is widely used to evaluate novel architectures and methodologies in both object detection and (semantic/instance) segmentation within the remote sensing community.
1. Dataset Composition and Annotation
iSAID comprises 2,806 high-resolution images with spatial resolutions ranging from approximately to $13,000$ pixels on the long edge. These images are sourced from multiple platforms and sensors, ensuring wide heterogeneity in scene content, altitude, and angle (Zamir et al., 2019). The dataset contains 655,451 annotated object instances across 15 semantically labeled foreground categories:
- Plane
- Ship
- Storage tank
- Baseball diamond
- Tennis court
- Basketball court
- Ground track field
- Harbor
- Bridge
- Large vehicle
- Small vehicle
- Helicopter
- Roundabout
- Swimming pool
- Soccer-ball field
Annotations were produced through a multi-tiered process: expert-designed guidelines, trained annotators with rigorous assessment, and five stages of quality control (self, peer, supervisory and expert review, and statistical outlier checks). Each object is specified by a polygonal mask, rasterized at the original resolution. Annotation guidelines demand precise boundary tracing, even for extremely small (10 px) or large (1,000 px) objects (Zamir et al., 2019). The annotation files adopt the COCO JSON format, including segmentation polygons, bounding boxes, class IDs (0–14), and image IDs (Demidov et al., 2022).
Dataset splits (exact image counts vary slightly by version and protocol) are approximately:
| Split | Images |
|---|---|
| Training | ≈1408–1411 |
| Validation | ≈450–458 |
| Testing | ≈935–948 |
Test annotations are withheld for blind benchmarking. Directory structure follows COCO convention, with images and annotation JSONs partitioned by split (Demidov et al., 2022).
2. Unique Challenges of Aerial Instance Segmentation
iSAID formalizes four principal challenges distinct from natural scene benchmarks:
- Extreme instance density: On average, each image contains ≈239 annotated objects (maximum ≈8,000), compared to 7.1 for MSCOCO and 2.6 for Cityscapes (Zamir et al., 2019). This demands architectures robust to severe object overlap and clustering.
- Scale variation and size distribution: Object sizes range from 10×10 pixels to many thousands of pixels. The global area ratio , with 52% of instances "small" (area ), 33.7% "medium" (), 9.7% "large" (). Processing requires tiling strategies and multi-scale feature integration (Zamir et al., 2019, Demidov et al., 2022).
- Severe class imbalance: Some classes (e.g., basketball court, ground track field) have under 1,000 instances; "small vehicle" alone comprises over 700,000 labels. The observed class imbalance ratio approaches 4,680:1 between the most and least frequent categories (Demidov et al., 2022).
- Arbitrary orientation and large aspect ratios: Aspect ratios reach up to 90:1. Oriented object representation or rotation-equivariant models are required to handle variably oriented and elongated objects (Zamir et al., 2019).
These factors conspire to make conventional approaches from natural-scene instance segmentation suboptimal on iSAID.
3. Evaluation Protocols and Metrics
iSAID employs COCO-style metrics for both detection and segmentation. Core evaluation protocols include (Zamir et al., 2019, Demidov et al., 2022, Dahal et al., 2024):
- Average Precision (AP): Computed as the mean precision over recall for thresholds in (in steps of $13,000$0), with $13,000$1 and $13,000$2 reported.
- Mean Average Precision (mAP): Average of per-class AP over all IoU thresholds.
- Instance-size specific AP: $13,000$3 for small, medium, and large masks.
- Semantic Segmentation: Mean Intersection-over-Union (mIoU), Dice score, and entropy-based losses.
- mIoU:
$13,000$4 - Dice score:
$13,000$5
Objects with less than 10 pixels in any dimension are excluded from official detection and segmentation evaluation (Demidov et al., 2022).
4. Baseline Methods and Model Performance
Performance baselines span both detection and segmentation, evaluated on strategic crops (typically $13,000$6 to $13,000$7) due to hardware constraints (Demidov et al., 2022, Zamir et al., 2019). Representative models and results include:
Object Detection and Instance Segmentation
| Model (Backbone) | [email protected]:.95 | [email protected] | [email protected] | AP_S | AP_M | AP_L |
|---|---|---|---|---|---|---|
| Mask R-CNN (ResNet-50) | 32.2% | 54.6% | 34.5% | — | — | — |
| PANet (ResNet-101) | 34.17% | 56.57% | 35.84% | 19.56% | 42.27% | 46.62% |
| PANet++ (ResNet-152) | 40.0% | 64.54% | 42.5% | 42.46% | 54.74% | 43.16% |
| Faster R-CNN + FPN | 29.7% | 52.3% | 31.8% | — | — | — |
| YOLOv3 | 14.4% | 36.2% | 11.8% | — | — | — |
Semantic Segmentation (mIoU, Dice):
| Model | mIoU | Dice | Params | Inference (s/img) |
|---|---|---|---|---|
| UNet (CNN, fused loss) | 73.4% | 74% | 42.9M | 0.19 |
| MaskFormer (ViT, Swin-L) | 82.48% | 82% | 200M (frozen) | 0.29 |
| AerialFormer-B (ViT, SOTA) | 69.3% | — | 113.8M | — |
Two-stage and attention-based models consistently outperform one-stage baselines by ≥10 mAP. Incorporation of attention (CBAM, SENet, etc.) and weighted sampling/focal losses yields measurable gains in the presence of class imbalance and small-object prevalence (Demidov et al., 2022, Dahal et al., 2024).
5. Model Modifications, Losses, and Practical Recommendations
Modeling adaptations for iSAID:
- Tiling/Sliding-window inference: Essential for retaining small-object fidelity. Standard crop sizes are $13,000$8 (stride 512‒800) (Demidov et al., 2022).
- Feature Pyramid Networks (FPN): Enable multi-scale aggregation.
- Weighted attention FPN: Learnable per-scale fusion; weights $13,000$9 computed as:
0
where 1 denotes global average pooling.
- Class-balanced sampling: Probability for class 2,
3
Counteracting the long-tailed class distribution.
- Weighted focal loss:
4
- Density prediction head: Auxiliary head computes density maps 5 with 6 penalty versus Gaussianized ground-truth (Demidov et al., 2022).
For semantic segmentation:
- Fused loss function:
7
with 8 and 9 as standard per-pixel set similarities, excluding background.
- Augmentation: Random cropping, flipping, full 0 rotation, and normalization to ImageNet statistics are effective (Dahal et al., 2024).
6. Empirical Insights and Best Practices
- Two-stage detectors augmented with mask (instance) and attention heads yield mAP in the 30–37 range, while one-stage methods trail by ≈10–15 points (Demidov et al., 2022).
- Semantic segmentation models based on transformer backbones (MaskFormer, AerialFormer) exhibit superior mIoU and Dice, but require greater resources (Dahal et al., 2024).
- Class imbalance mitigation (sampling, loss reweighting) improves mAP by 3–5 points and significantly boosts rare-class recall (Demidov et al., 2022).
- Sliding window and crop-based inference preserve both context and spatial resolution, enabling models to detect extremely small objects while maintaining global structure (Zamir et al., 2019).
- Integration of fused Dice/IoU/Cross-entropy loss leads to notable absolute mIoU increases (1+15%) over standard cross-entropy in CNNs (Dahal et al., 2024).
7. Research Impact and Directions
iSAID has established itself as the primary benchmark for dense, high-resolution aerial imagery segmentation and detection (Zamir et al., 2019). Its use catalyzed new model developments, including scale-robust backbones, rotation-equivariant networks, and loss/augmentation strategies tailored to the non-i.i.d. scale and orientation of aerial scenes. The challenge of small-object segmentation, dense clustering, and class imbalance evident in benchmark results continues to drive research in specialized NMS, feature/exemplar re-weighting, and domain adaptation. Over 80% of state-of-the-art results on iSAID now utilize attention mechanisms or transformer architectures (Dahal et al., 2024). The dataset remains actively maintained, with public code and evaluation servers supporting reproducible research.
For further reference, the dataset and benchmark details, including download links and evaluation protocols, are available at https://captain-whu.github.io/iSAID/index.html (Zamir et al., 2019).