DUT Anti-UAV Dataset
- DUT Anti-UAV dataset is a curated benchmark for vision-based UAV detection and tracking with detailed manual annotations and varied scene conditions.
- It is structured into dedicated detection and tracking subsets, enabling rigorous evaluation using standard metrics like IoU, precision, and mAP.
- The dataset supports detection-fused tracking and open MIT licensing, fostering advancements in anti-UAV surveillance research.
The DUT Anti-UAV dataset is a curated benchmark designed for computer vision-based UAV detection and tracking, addressing the growing demand for effective anti-UAV surveillance strategies in complex visual environments. Developed by Dalian University of Technology, the dataset comprises visible-light imagery and video sequences with precise manual annotations, facilitating rigorous evaluation of state-of-the-art detection and tracking algorithms. It is structured into dedicated detection and tracking subsets and is released under the MIT license, enabling broad academic and commercial usage (Zhao et al., 2022).
1. Dataset Structure and Annotation Protocol
Detection Subset
The detection set contains 10,000 images partitioned into training (5,200 images; 5,243 UAV annotations), validation (2,600; 2,621), and test (2,200; 2,245) splits. Each instance of the sole “UAV” class is labeled using an axis-aligned bounding box (; units in pixels) and typically recorded as:
1 |
000123.jpg, x_min=345, y_min=120, w=48, h=30, class="uav" |
Over 35 distinct UAV types are represented, spanning rotorcraft and fixed-wing models. Scene diversity includes urban (high-rises, parking lots), rural (fields, forests), varied weather (sunny, cloudy, snowy), and broad temporal coverage (dawn/dusk, day, night).
Quantitative image statistics:
- UAV bounding box area ratio: (min) to $0.70$ (max); mean
- Aspect ratio (): $1.0$ to $6.67$; mean
- UAV spatial distribution: centers denser near image centroid, otherwise uniform in /
Tracking Subset
Composed of 20 fully annotated video sequences (total frames; frames/sequence), these span both short-term (500 frames) and long-term (2,000 frames) tracking scenarios. Each frame contains a single axis-aligned bounding box. Video resolutions are or .
Key dynamics:
- UAV scale (box/image area): per-frame ratio to
- Aspect ratio per-frame: $1.0$ to $4.0$ (notably, Video 10 varies )
- Presented challenges: extremely small targets, rapid motion, camera shake, illumination variation, occlusion, and out-of-view events
2. Evaluation Protocols and Metrics
Detection Evaluation
Standard object detection metrics are adopted:
- Intersection over Union (IoU):
- True/False Positives (TP/FP), False Negatives (FN): TP if and class match; FN if GT box unmatched
- Precision and Recall:
- F₁-score:
- Average Precision (AP): at IoU threshold
- mean AP (mAP): ; in experiments
Tracking Evaluation (One-Pass Evaluation, OPE)
Trackers are initialized with the ground-truth in frame 1, run to sequence end with no re-initialization:
- Success plot (overlap): Given at frame , rate at threshold :
- Overall success score: Area under over :
- Precision plot (center-error): Euclidean center distance; precision at error :
- Normalized precision: AUC of for px, divided by $50$
3. Benchmark Algorithm Results
Detection Baselines
| Method | Backbone | mAP | FPS |
|---|---|---|---|
| Faster-R-CNN | ResNet-50 | 0.653 | 12.8 |
| Cascade-R-CNN | ResNet-50 | 0.683 | 10.7 |
| ATSS | ResNet-50 | 0.642 | 13.3 |
| SSD | VGG16 | 0.632 | 33.2 |
| YOLOX | ResNet-18 | 0.400 | 53.7 |
- Cascade-R-CNN delivers the highest mAP and recall across all precision levels.
- YOLOX achieves maximal inference speed at the expense of detection accuracy.
- Precision–Recall curves at IoU=0.5 and IoU=0.75 further substantiate method-specific trade-offs.
Tracking Baselines
| Tracker | Success | NormPre | Pre@20px |
|---|---|---|---|
| SiamFC | 0.381 | 0.526 | 0.623 |
| ECO | 0.404 | 0.643 | 0.717 |
| SPLT | 0.405 | 0.585 | 0.651 |
| ATOM | 0.574 | 0.758 | 0.830 |
| SiamRPN++ | 0.545 | 0.709 | 0.780 |
| DiMP | 0.578 | 0.756 | 0.831 |
| TransT | 0.586 | 0.765 | 0.832 |
| LTMU | 0.608 | 0.783 | 0.858 |
LTMU demonstrates the highest success and precision scores among tested tracking approaches.
Detection-Fused Tracking
A detection-tracking fusion algorithm is specified: If tracker confidence (), re-detect and select the bounding box with higher confidence (subject to detector score ). Fusing with Faster-R-CNN (VGG16) yields the largest per-tracker improvements, e.g., SiamFC: 0.381 0.617 (+0.236 success), with similar improvements in precision.
4. Comparison to Anti-UAV Multi-Modal Benchmark
The DUT Anti-UAV is distinct from the larger Anti-UAV dataset (Jiang et al., 2021), which provides 318 unaligned RGB–IR video pairs and over 585,900 bounding boxes for advanced multi-modal tracking evaluation. While DUT is focused exclusively on visible-light imagery and offers single-class annotations, Anti-UAV emphasizes modality fusion and attribute-rich annotation (seven binary attributes per IR clip: Out-of-View, Occlusion, Fast Motion, Scale Variation, Low Illumination, Thermal Cross-Over, Low Resolution). Evaluative protocols, error definitions, and baseline results differ accordingly.
Both datasets share object-centered bounding box annotation. However, Anti-UAV expands the research landscape toward multi-modal (RGB/IR), long-term, and attribute-intensive UAV tracking. DUT’s smaller, more focused structure facilitates benchmarking fine-grained detection and tracking in visually challenging, single-modality scenes.
5. Dataset Distribution and Licensing
The DUT Anti-UAV dataset is accessible via https://github.com/wangdongdut/DUT-Anti-UAV, with the following structure:
- “detection” folder: Images and subfolder “Annotations” containing TXT/CSV files per image
- “tracking” folder: 20 subfolders, each with sequential frames (.jpg) and groundtruth.txt
- Example code for training/inference is provided under “scripts/”
The dataset is released under the MIT license, permitting unrestricted academic and commercial usage provided attribution is maintained (Zhao et al., 2022).
6. Implications and Future Research Directions
A plausible implication is that the DUT Anti-UAV dataset enables systematic study of visual anti-UAV detection and tracking in scenarios featuring small targets, broad environmental variation, and significant target appearance diversity. Its focused annotation protocol and detection-tracking fusion methodology advance high-precision tracking. Future research may leverage DUT for benchmarking detection-fused tracking, exploiting multi-modal fusion (as explored in Anti-UAV (Jiang et al., 2021)), and addressing challenges such as occlusion, scale variation, and out-of-view events via robust data augmentation or advanced model design.
The dataset’s open licensing and comprehensive structure support its use as a reference testbed for computer vision-based UAV surveillance systems, evaluation of detection/tracking architecture performance, and development of adaptive anti-UAV solutions.