Anti-UAV-RGBT Dataset Benchmark
- The Anti-UAV-RGBT Dataset is a large-scale multi-modal benchmark comprising 318 paired RGB and thermal video streams designed for robust UAV tracking.
- It features a rigorous three-stage annotation process yielding over 585,900 well-aligned bounding boxes with an IoU over 0.95 for validation.
- The Dual-Flow Semantic Consistency strategy improves tracking performance under challenges like occlusion and scale variation while maintaining zero inference overhead.
The Anti-UAV-RGBT Dataset is a large-scale multi-modal benchmark specifically designed for Unmanned Aerial Vehicle (UAV) tracking in RGB and thermal-infrared (T) video domains. It enables robust evaluation of tracking algorithms under diverse visibility and environmental conditions, with an emphasis on real-world surveillance scenarios. The dataset is notable for its dual-modality, comprehensive manual annotation, rigorously defined evaluation protocols, and the introduction of the Dual-Flow Semantic Consistency (DFSC) tracking strategy (Jiang et al., 2021).
1. Dataset Composition and Modalities
The dataset comprises 318 synchronized video pairs, each containing one visible-light (RGB) stream and one thermal-infrared (T) stream. Both sensors in the capture rig, which utilizes consumer UAVs from DJI and Parrot, record at 25 frames per second: RGB at full HD resolution (1920×1080), thermal at VGA-equivalent (640×512). Synchronization is achieved via shared timestamps for corresponding frames, ensuring temporal alignment but leaving the spatial axes and fields of view unaligned—presenting a challenge for pixel-level multi-modal fusion. No geometric registration between RGB and T streams is performed post-capture.
Table: Anti-UAV-RGBT Dataset Composition
| Stream Type | Resolution | Frame Rate | Number of Pairs | Spatial Alignment |
|---|---|---|---|---|
| RGB | 1920×1080 | 25 FPS | 318 | Unaligned |
| Thermal-IR | 640×512 | 25 FPS | 318 | Unaligned |
Each frame contains a timestamp, guaranteeing frame-level temporal correspondence between modalities.
2. Annotation Methodology and Data Statistics
Annotations exceed 585,900 bounding boxes, with every frame in all 318 video pairs labeled. The annotation workflow adopts a three-stage, coarse-to-fine pipeline:
- Coarse Pass: Every 25th frame is flagged for target existence and loosely bounded.
- Fine Pass: The 30 most challenging pairs per scene are exhaustively annotated at 25 FPS.
- Inspection & Correction: An independent annotator reviews all frames, correcting box placements and existence flags.
Inter-annotator agreement achieves an intersection-over-union (IoU) score above 0.95 on a random 5% subsample. Annotation utilizes proprietary tools for efficient dual-modality frame navigation and visual flagging.
Split statistics:
- Training: 160 pairs (294,400 bboxes)
- Validation: 67 pairs (122,900 bboxes)
- Test: 91 pairs (168,400 bboxes)
- Total annotation span: Over 23,000 seconds
Test set scene breakdown: ~70% daytime, 30% low-light. Each thermal sequence in the test set receives binary attribute tags—Out-of-View (OV), Occlusion (OC), Fast Motion (FM), Scale Variation (SV), Low Illumination (LI), Thermal Crossover (TC; subdivided into easy/medium/hard by algorithmic difficulty), and Low Resolution (LR). Notably, TC_hard comprises ~20% of test sequences.
3. Evaluation Protocols and Metrics
Three evaluation protocols are defined:
- Protocol I: Excludes Anti-UAV data during training—trackers can utilize any non-UAV datasets. Testing occurs separately on RGB and T streams.
- Protocol II: Permits fine-tuning or full training on the Anti-UAV training split; performance is measured on validation and test subsets.
- Protocol III: Provides both streams during testing (still spatially unaligned), facilitating exploration of multi-modal fusion.
Key metrics:
- State Accuracy (SA)
where flags the presence of a target, denotes overlap at time , and is the tracker’s absent/present prediction.
- Precision (center error)
typically with pixels.
- Success ()
The success curve is plotted and area under the curve (AUC) is reported.
4. Dual-Flow Semantic Consistency (DFSC) Tracking Strategy
DFSC is an approach leveraging cross-sequence semantic consistency, exploiting the presence of only one object class, UAV, in the dataset. The method consists of two stages:
- Class-level Semantic Modulation (CSM): Generates cross-modulated () and intra-sequence () features for query-search pairs . The region proposal network is trained to minimize:
where combines classification and regression losses, balanced by .
- Instance-level Semantic Modulation (ISM): The top region proposals for each are re-modulated with , and the R-CNN head trains:
The total loss is . DFSC’s modulation operations are performed offline, yielding zero inference overhead.
Training details for Protocol II:
- Visible branch: GlobalTrack pre-training, 12 epochs, learning rate annealed from 0.02 to 0.0002.
- Thermal branch: Faster-RCNN (ImageNet) pre-training, 18 epochs, similar learning rate schedule.
- Batch size: 2 per GPU, regression uses smooth , classification uses cross-entropy.
5. Experimental Performance and Benchmark Analysis
Under Protocol I on the infrared test set, state-of-the-art deep trackers such as SiamRCNN and GlobalTrack reach approximately 66% , 87% precision, and 63% success; correlation-filter trackers lag at 40% . Protocol II results show fine-tuned GlobalTrack gains: vanilla finetune yields 63.86% , DFSC achieves 66.24% (+2.38 pts) on IR test. DFSC advances Visible IR test metrics by ~0.6 AUC and ~0.6 precision points; validation split increases from 72.00% to 80.09% (+8 pts).
Attribute-wise, DFSC brings the most pronounced improvements for:
- Occlusion (OC): +1.28 pts (validation)
- Scale Variation (SV): +0.97 pts (validation)
- Thermal Crossover (TC_all): +1.07 pts (test)
Qualitatively, DFSC demonstrates robustness against heavy occlusion and thermal distractors. No additional inference-time cost is incurred due to the offline nature of semantic modulation.
6. Research Significance and Future Directions
Anti-UAV-RGBT is the first benchmark to provide large-scale, unaligned RGB and thermal video for UAV tracking. It enables critical progress in developing robust anti-UAV surveillance and tracking systems, especially in scenarios with varying illumination, occlusion, and modal transitions. DFSC introduces a training paradigm that leverages dataset-wide semantic consistency, producing stronger, UAV-specific representations; observed empirical gains validate its effectiveness for long-term tracking under challenging conditions.
A plausible implication is that methodologies exploiting cross-sequence or class-level semantics may generalize well to other domains with uniform object categories. Expanding into modalities beyond RGB/T, such as LiDAR or radar, and refining multi-modal fusion are identified as crucial areas for future research (Jiang et al., 2021).