- The paper demonstrates that single-stage models like RetinaNet deliver faster detection for large objects such as fracking wells.
- The study reveals that two-stage detectors, notably Grid R-CNN, achieve higher precision for densely packed small cars.
- The comparison highlights key trade-offs between detection speed and accuracy across different model architectures.
A Comparison of Deep Learning Object Detection Models for Satellite Imagery
Introduction
The paper focuses on evaluating state-of-the-art object detection models for the specific task of identifying oil and gas fracking wells and small cars within commercial electro-optical (EO) satellite imagery. It explores several models from the single-stage, two-stage, and multi-stage object detection families, analyzing their detection accuracy and processing speed. These models are crucial for automating the analysis of satellite imagery, a task of increasing importance due to the vast amounts of data collected by modern sensors.
Figure 1: Examples of fracking wells in various phases in New Mexico, USA. Images: DigitalGlobe, Inc.
Problem and Dataset
The detection of objects in satellite images, such as fracking wells and small cars, presents unique challenges including large image sizes, varying object scales, and occlusions. The dataset involves manually curated fracking well images from WorldView-3 imagery and small car images from the xView dataset, which is publicly available and consists of a wide range of contexts and environments.
Figure 2: Examples of densely packed small cars in an urban environment from the xView dataset. Each colored box represents one object our algorithms aim to detect.
Methods and Models
Several models were selected using the MMDetection library, which standardizes various architectures into functional elements like backbones for feature extraction. The models include:
- Single-stage detectors: RetinaNet and GHM, which aim for speedy detection by bypassing intermediate proposal stages.
- Two-stage detectors: Faster R-CNN, Grid R-CNN, and Double-Head R-CNN, which use an initial stage for proposing regions followed by a refinement stage.
- Multi-stage detector: Cascade R-CNN, which uses a sequence of detectors with increasing localization precision constraints.
The models were trained and evaluated on both object types, with benchmarking against the sliding window approach for performance comparison.
Figure 3: Schematic of the sliding window object detection technique. Blue box represents a notional ground truth bounding box. Image: DigitalGlobe, Inc.
Results
The study found that, for detecting fracking wells, single-stage models like RetinaNet and GHM achieved superior performance metrics and processing speeds. Meanwhile, the detection of small cars favored two-stage and multi-stage detectors, achieving better precision at a cost to speed. Specifically, Grid R-CNN exhibited the highest accuracy for car detection.
The evaluation metrics included average precision (AP), max $\mathrm{F_{1}$ score}, and area coverage rate, which were meticulously analyzed across models and backbone architectures.
Figure 4: Example prediction output from Grid R-CNN (with ResNet-101 backbone) on an xView validation image containing 3929 small cars. Yellow boxes are ground truth bounding boxes and red are predicted bounding boxes.
Conclusion
The research underscores the importance of model selection based on specific task requirements in satellite imagery processing. While single-stage models provide a significant advantage in speed for larger objects like fracking wells, two-stage models yield better results for smaller, densely packed objects such as cars. The findings have implications for deploying satellite image analysis systems that optimize for either speed or accuracy based on the operational need.
Future work aims to evaluate additional models and explore multi-class object detection scenarios to broaden the applicability of these techniques in diverse environmental contexts and imaging needs.