DUT Anti-UAV Benchmark
- The DUT Anti-UAV Benchmark is a comprehensive suite of datasets and evaluation protocols that assess UAV detection and tracking under diverse, challenging conditions.
- It includes extensive static images and video sequences with precise annotations for small-object detection, rapid motion, occlusions, and multi-scale dynamics.
- Evaluation protocols leverage metrics like IoU, mAP, and success AUC to benchmark both detection and tracking, fostering advancements in fusion detection-tracking pipelines.
The DUT Anti-UAV Benchmark is a suite of datasets and evaluation protocols created to advance research in computer-vision-based unmanned aerial vehicle (UAV) detection and tracking under challenging real-world scenarios. It originated at Dalian University of Technology and has evolved through multiple international workshops and community challenges, often referenced interchangeably with “Anti-UAV” or “UAV-Anti-UAV” benchmarks in literature. The benchmark enables precise quantitative evaluation of both detection and tracking algorithms across RGB and thermal modalities, concentrating on small-object, motion-rich, and multi-scale dynamics typical of civil drone flights.
1. Dataset Structure and Collection Protocols
The DUT Anti-UAV Benchmark is comprised of distinct subsets targeting both detection and tracking tasks, annotated with rigorous protocols. The canonical version contains:
- Detection Subset: 10,000 static images, split as 5,200 for training, 2,600 validation, and 2,200 test. Each image is manually labeled for UAV presence via axis-aligned bounding boxes. Scene diversity spans urban, natural, and sky backgrounds, various weather/lighting, and extreme scale variation (object area ratio from 1.9×10⁻⁶ to 0.7).
- Tracking Subset: Typically 20 RGB video sequences (some versions report 232 clips and scaling to >1,810 clips in UAV-Anti-UAV extension) with per-frame bounding box annotations. Sequences range from several dozen to several thousand frames, encompassing short- and long-term tracking, occlusion events, and rapid maneuvers.
All frames and images are annotated by human experts. Bounding-box values use the format (x, y, w, h) or center-based encodings, supporting conversion to YOLO or COCO standards for model training and evaluation. The presence/absence flag is attached per frame for tracking, enabling explicit handling of missed or occluded targets (Zhao et al., 2022, Elshaar et al., 2024, Ren et al., 2 Dec 2025).
2. Benchmark Task Definitions and Annotation Challenges
- Detection: Detect all UAVs present within each static image, producing a list of bounding boxes per image.
- Single UAV Tracking: Track the visible target throughout each video, with predictions required for every frame. Handling of out-of-view and intermittent disappearance is strictly defined via ; trackers must output an empty box when the UAVs are not visible.
- Dual-Dynamic Tracking (UAV-Anti-UAV): The latest vision is capturing UAV-on-UAV pursuit, where both imaging platform and target exhibit strong motion, sharp viewpoint shifts, extensive occlusions, and rapid background changes (Zhang et al., 8 Dec 2025).
Challenges encountered in annotation include:
- Tiny targets (down to <0.01% frame area).
- Large variations in aspect ratio both across and within sequences.
- Frequent occlusions (partial or full), severe motion blur, low illumination (night, haze), thermal crossover in IR data, and background distractors (birds, balloons).
- Sequences tagged with multi-attribute metadata for diagnostic purposes (e.g., fast motion, occlusion, scale, lighting variation, small object, rotation) (Zhang et al., 8 Dec 2025, Jiang et al., 2021).
3. Evaluation Metrics and Protocols
Detection and tracking are evaluated under strict protocols, using metrics standard in object detection and single-target tracking benchmarks:
- Intersection over Union (IoU): . Detection typically uses an IoU threshold .
- mean Average Precision (mAP): Area under the precision-recall curve, usually averaged over several IoU thresholds (mAP50–95). Used extensively in detection evaluation (Elshaar et al., 2024).
- Success (AUC): Fraction of frames where IoU threshold, plotted over and integrated.
- Precision (Center-Error at px): Proportion of frames where the Euclidean center error is below a threshold ( px commonly).
- Normalized Precision: Center-error normalized by frame dimensions ().
- Custom Tracking Accuracy: For workshop challenge scoring, a one‐pass frame-wise accuracy:
indicates predicted absence. In the 3rd workshop, an additional penalty term handles over-prediction when the UAV is present (Zhao et al., 2023).
Multi-object tracking metrics (MOTA, MOTP, ID switches, fragmentation) are applied as appropriate, especially in multi-UAV or generalized settings.
4. Baseline Methods and Model Performance
Detection baselines include both two-stage and one-stage methods:
- Cascade-R-CNN, Faster-R-CNN (ResNet18/ResNet50/VGG16), ATSS, YOLOX, SSD.
- YOLOv5x achieved highest mAP (0.976 at IoU 0.5), with YOLOv8 variants exhibiting better recall/resilience in low-contrast or motion-blurred scenes (Zhao et al., 2022, Elshaar et al., 2024).
Tracking baselines span Siamese networks (SiamFC, SiamRPN++, SuperDiMP), discriminative models (DiMP, ATOM), long-term meta-updaters (LTMU), and transformer-based trackers (TransT, DropTrack, UAUTrack). Detection-fused tracking approaches (e.g., LTMU + Faster-R-CNN) showed a 9% relative gain in success AUC compared to standalone trackers (Zhao et al., 2022, Ren et al., 2 Dec 2025).
Recent methods have achieved:
- UAUTrack: AUC 65.8%, Precision (20 px) 87.3%, Normalized Precision 91.8% at 45 fps (Ren et al., 2 Dec 2025).
- MambaSTS (UAV-Anti-UAV): AUC 0.437, Precision@20 px 0.602, normalized precision 0.480 (Zhang et al., 8 Dec 2025).
For MOT, BoT-SORT yields higher stability and IoU than ByteTrack, especially under occlusion and rapid maneuvers (Elshaar et al., 2024).
5. Challenge Results and Winning Strategies
The Anti-UAV Workshop & Challenge has become a main venue for benchmarking tracking methods under multi-scale, infrared, and complex background conditions (Zhao et al., 2021, Zhao et al., 2023).
Selected top methods and strategies:
- Siam R-CNN with spatio-temporal attention, optical flow switching, and frame-difference change-detection (Zhao et al., 2021).
- Ensemble meta-tracker with multi-scale search, motion enhancement, and voting-based box fusion.
- Unified Transformers (e.g., UTTracker) leveraging background alignment, global detection, multi-region local tracking, and dynamic detection modules.
- Detection-by-tracking pipelines with strong detector ensembles, temporal CNN classifiers, and explicit motion modeling (Zhao et al., 2023).
- Motion-guided small-object detection, retinal motion maps, and spatial/coordinate attention in the backbone (YOLOv5s/l) (Zhao et al., 2023).
Ablation studies across methods routinely show incremental gains from optical flow, global search, multi-scale fusion, motion-background modeling, and re-detection.
Leaderboard results indicate top methods reach tracking accuracy scores () in the range of 0.644–0.700, with Track 2 (detection & tracking without prior) scoring lower (0.570–0.611) due to the initialization challenge.
6. Impact, Open Problems, and Future Directions
The DUT Anti-UAV Benchmark highlights persistent challenges:
- Tiny, textureless targets in noisy backgrounds.
- Dynamic scenes with platform and target motion, frequent occlusion, illumination changes, and distractor clutter.
- Failure modes: target drift during disappearance, false positives from visual distractors, inadequate handling of severe motion blur and scale variation.
Recommendations for future work:
- Dataset augmentation (contrast, occlusion, blur), attribute-rich diagnostics (occlusion, scale, motion, multi-point pose tags).
- Lightweight detection models for edge deployment.
- End-to-end joint detection-tracking architectures, transformer-based networks with memory modules.
- Multi-modal fusion (RGB+T/IR), robust spatio-temporal attention, and language-guided (semantic token) tracking (Ren et al., 2 Dec 2025, Zhang et al., 8 Dec 2025).
- Standardization of metrics (MOTA, MOTP, ID switches, fragments) and release of official baselines.
The benchmark's evolution toward million-scale, multi-modal, aerial-to-aerial tracking underlines significant performance gaps in current deep trackers and the ongoing need for robust algorithms capable of reliable real-world Anti-UAV deployment.
7. Data Access and Reproducibility
- Official datasets, annotations, and leaderboards: [https://anti-uav.github.io/] and [https://github.com/wangdongdut/DUT-Anti-UAV].
- Model training protocols typically employ MMDetection and support COCO-style evaluation. For tracking, public implementations and wrapper scripts are provided, along with demo codes for detection-tracking fusion.
- Workshop leaderboards are periodically reopened, facilitating ongoing evaluation and comparison as new methods are developed.
Researchers are encouraged to use the benchmark for both detection and tracking algorithm development, with particular attention to the explicit per-frame absence labeling, rapid dynamics, and diversity of backgrounds and weather conditions (Zhao et al., 2021, Zhao et al., 2022, Zhao et al., 2023, Elshaar et al., 2024, Ren et al., 2 Dec 2025, Zhang et al., 8 Dec 2025, Jiang et al., 2021).