Anti-UAV-RGBT Dataset Benchmark

Updated 2 February 2026

The Anti-UAV-RGBT Dataset is a large-scale multi-modal benchmark comprising 318 paired RGB and thermal video streams designed for robust UAV tracking.
It features a rigorous three-stage annotation process yielding over 585,900 well-aligned bounding boxes with an IoU over 0.95 for validation.
The Dual-Flow Semantic Consistency strategy improves tracking performance under challenges like occlusion and scale variation while maintaining zero inference overhead.

The Anti-UAV-RGBT Dataset is a large-scale multi-modal benchmark specifically designed for Unmanned Aerial Vehicle (UAV) tracking in RGB and thermal-infrared (T) video domains. It enables robust evaluation of tracking algorithms under diverse visibility and environmental conditions, with an emphasis on real-world surveillance scenarios. The dataset is notable for its dual-modality, comprehensive manual annotation, rigorously defined evaluation protocols, and the introduction of the Dual-Flow Semantic Consistency (DFSC) tracking strategy (Jiang et al., 2021).

1. Dataset Composition and Modalities

The dataset comprises 318 synchronized video pairs, each containing one visible-light (RGB) stream and one thermal-infrared (T) stream. Both sensors in the capture rig, which utilizes consumer UAVs from DJI and Parrot, record at 25 frames per second: RGB at full HD resolution (1920×1080), thermal at VGA-equivalent (640×512). Synchronization is achieved via shared timestamps for corresponding frames, ensuring temporal alignment but leaving the spatial axes and fields of view unaligned—presenting a challenge for pixel-level multi-modal fusion. No geometric registration between RGB and T streams is performed post-capture.

Table: Anti-UAV-RGBT Dataset Composition

Stream Type	Resolution	Frame Rate	Number of Pairs	Spatial Alignment
RGB	1920×1080	25 FPS	318	Unaligned
Thermal-IR	640×512	25 FPS	318	Unaligned

Each frame contains a timestamp, guaranteeing frame-level temporal correspondence between modalities.

2. Annotation Methodology and Data Statistics

Annotations exceed 585,900 bounding boxes, with every frame in all 318 video pairs labeled. The annotation workflow adopts a three-stage, coarse-to-fine pipeline:

Coarse Pass: Every 25th frame is flagged for target existence and loosely bounded.
Fine Pass: The 30 most challenging pairs per scene are exhaustively annotated at 25 FPS.
Inspection & Correction: An independent annotator reviews all frames, correcting box placements and existence flags.

Inter-annotator agreement achieves an intersection-over-union (IoU) score above 0.95 on a random 5% subsample. Annotation utilizes proprietary tools for efficient dual-modality frame navigation and visual flagging.

Split statistics:

Training: 160 pairs (294,400 bboxes)
Validation: 67 pairs (122,900 bboxes)
Test: 91 pairs (168,400 bboxes)
Total annotation span: Over 23,000 seconds

Test set scene breakdown: ~70% daytime, 30% low-light. Each thermal sequence in the test set receives binary attribute tags—Out-of-View (OV), Occlusion (OC), Fast Motion (FM), Scale Variation (SV), Low Illumination (LI), Thermal Crossover (TC; subdivided into easy/medium/hard by algorithmic difficulty), and Low Resolution (LR). Notably, TC_hard comprises ~20% of test sequences.

3. Evaluation Protocols and Metrics

Three evaluation protocols are defined:

Protocol I: Excludes Anti-UAV data during training—trackers can utilize any non-UAV datasets. Testing occurs separately on RGB and T streams.
Protocol II: Permits fine-tuning or full training on the Anti-UAV training split; performance is measured on validation and test subsets.
Protocol III: Provides both streams during testing (still spatially unaligned), facilitating exploration of multi-modal fusion.

Key metrics:

State Accuracy (SA)

$SA = \frac{1}{T}\sum_{t=1}^T [\,\mathrm{IoU}_t\cdot \mathbf{1}(v_t=1) + p_t\cdot\mathbf{1}(v_t=0)\,]$

where $v_t$ flags the presence of a target, $\mathrm{IoU}_t$ denotes overlap at time $t$ , and $p_t$ is the tracker’s absent/present prediction.

Precision (center error)

$P(\delta)=\frac{\#\{\|\hat c_t-c_t\|<\delta\}}{T}$

typically with $\delta=20$ pixels.

Success ( $\mathrm{IoU}$ )

$S(\tau)=\frac{1}{T}\sum_{t=1}^T \mathbf{1}(\mathrm{IoU}_t>\tau)$

The success curve $S(\tau)$ is plotted and area under the curve (AUC) is reported.

4. Dual-Flow Semantic Consistency (DFSC) Tracking Strategy

DFSC is an approach leveraging cross-sequence semantic consistency, exploiting the presence of only one object class, UAV, in the dataset. The method consists of two stages:

Class-level Semantic Modulation (CSM): Generates cross-modulated ( $v_t$ 0) and intra-sequence ( $v_t$ 1) features for query-search pairs $v_t$ 2. The region proposal network is trained to minimize:

$v_t$ 3

where $v_t$ 4 combines classification and regression losses, balanced by $v_t$ 5.

Instance-level Semantic Modulation (ISM): The top $v_t$ 6 region proposals for each $v_t$ 7 are re-modulated with $v_t$ 8, and the R-CNN head trains:

$v_t$ 9

The total loss is $\mathrm{IoU}_t$ 0. DFSC’s modulation operations are performed offline, yielding zero inference overhead.

Training details for Protocol II:

Visible branch: GlobalTrack pre-training, 12 epochs, learning rate annealed from 0.02 to 0.0002.
Thermal branch: Faster-RCNN (ImageNet) pre-training, 18 epochs, similar learning rate schedule.
Batch size: 2 per GPU, regression uses smooth $\mathrm{IoU}_t$ 1, classification uses cross-entropy.

5. Experimental Performance and Benchmark Analysis

Under Protocol I on the infrared test set, state-of-the-art deep trackers such as SiamRCNN and GlobalTrack reach approximately 66% $\mathrm{IoU}_t$ 2, 87% precision, and 63% success; correlation-filter trackers lag at 40% $\mathrm{IoU}_t$ 3. Protocol II results show fine-tuned GlobalTrack gains: vanilla finetune yields 63.86% $\mathrm{IoU}_t$ 4, DFSC achieves 66.24% $\mathrm{IoU}_t$ 5 (+2.38 pts) on IR test. DFSC advances Visible IR test metrics by ~0.6 AUC and ~0.6 precision points; validation split $\mathrm{IoU}_t$ 6 increases from 72.00% to 80.09% (+8 pts).

Attribute-wise, DFSC brings the most pronounced improvements for:

Occlusion (OC): +1.28 pts (validation)
Scale Variation (SV): +0.97 pts (validation)
Thermal Crossover (TC_all): +1.07 pts (test)

Qualitatively, DFSC demonstrates robustness against heavy occlusion and thermal distractors. No additional inference-time cost is incurred due to the offline nature of semantic modulation.

6. Research Significance and Future Directions

Anti-UAV-RGBT is the first benchmark to provide large-scale, unaligned RGB and thermal video for UAV tracking. It enables critical progress in developing robust anti-UAV surveillance and tracking systems, especially in scenarios with varying illumination, occlusion, and modal transitions. DFSC introduces a training paradigm that leverages dataset-wide semantic consistency, producing stronger, UAV-specific representations; observed empirical gains validate its effectiveness for long-term tracking under challenging conditions.

A plausible implication is that methodologies exploiting cross-sequence or class-level semantics may generalize well to other domains with uniform object categories. Expanding into modalities beyond RGB/T, such as LiDAR or radar, and refining multi-modal fusion are identified as crucial areas for future research (Jiang et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anti-UAV-RGBT Dataset.