Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anti-UAV-RGBT Dataset Benchmark

Updated 2 February 2026
  • The Anti-UAV-RGBT Dataset is a large-scale multi-modal benchmark comprising 318 paired RGB and thermal video streams designed for robust UAV tracking.
  • It features a rigorous three-stage annotation process yielding over 585,900 well-aligned bounding boxes with an IoU over 0.95 for validation.
  • The Dual-Flow Semantic Consistency strategy improves tracking performance under challenges like occlusion and scale variation while maintaining zero inference overhead.

The Anti-UAV-RGBT Dataset is a large-scale multi-modal benchmark specifically designed for Unmanned Aerial Vehicle (UAV) tracking in RGB and thermal-infrared (T) video domains. It enables robust evaluation of tracking algorithms under diverse visibility and environmental conditions, with an emphasis on real-world surveillance scenarios. The dataset is notable for its dual-modality, comprehensive manual annotation, rigorously defined evaluation protocols, and the introduction of the Dual-Flow Semantic Consistency (DFSC) tracking strategy (Jiang et al., 2021).

1. Dataset Composition and Modalities

The dataset comprises 318 synchronized video pairs, each containing one visible-light (RGB) stream and one thermal-infrared (T) stream. Both sensors in the capture rig, which utilizes consumer UAVs from DJI and Parrot, record at 25 frames per second: RGB at full HD resolution (1920×1080), thermal at VGA-equivalent (640×512). Synchronization is achieved via shared timestamps for corresponding frames, ensuring temporal alignment but leaving the spatial axes and fields of view unaligned—presenting a challenge for pixel-level multi-modal fusion. No geometric registration between RGB and T streams is performed post-capture.

Table: Anti-UAV-RGBT Dataset Composition

Stream Type Resolution Frame Rate Number of Pairs Spatial Alignment
RGB 1920×1080 25 FPS 318 Unaligned
Thermal-IR 640×512 25 FPS 318 Unaligned

Each frame contains a timestamp, guaranteeing frame-level temporal correspondence between modalities.

2. Annotation Methodology and Data Statistics

Annotations exceed 585,900 bounding boxes, with every frame in all 318 video pairs labeled. The annotation workflow adopts a three-stage, coarse-to-fine pipeline:

  1. Coarse Pass: Every 25th frame is flagged for target existence and loosely bounded.
  2. Fine Pass: The 30 most challenging pairs per scene are exhaustively annotated at 25 FPS.
  3. Inspection & Correction: An independent annotator reviews all frames, correcting box placements and existence flags.

Inter-annotator agreement achieves an intersection-over-union (IoU) score above 0.95 on a random 5% subsample. Annotation utilizes proprietary tools for efficient dual-modality frame navigation and visual flagging.

Split statistics:

  • Training: 160 pairs (294,400 bboxes)
  • Validation: 67 pairs (122,900 bboxes)
  • Test: 91 pairs (168,400 bboxes)
  • Total annotation span: Over 23,000 seconds

Test set scene breakdown: ~70% daytime, 30% low-light. Each thermal sequence in the test set receives binary attribute tags—Out-of-View (OV), Occlusion (OC), Fast Motion (FM), Scale Variation (SV), Low Illumination (LI), Thermal Crossover (TC; subdivided into easy/medium/hard by algorithmic difficulty), and Low Resolution (LR). Notably, TC_hard comprises ~20% of test sequences.

3. Evaluation Protocols and Metrics

Three evaluation protocols are defined:

  • Protocol I: Excludes Anti-UAV data during training—trackers can utilize any non-UAV datasets. Testing occurs separately on RGB and T streams.
  • Protocol II: Permits fine-tuning or full training on the Anti-UAV training split; performance is measured on validation and test subsets.
  • Protocol III: Provides both streams during testing (still spatially unaligned), facilitating exploration of multi-modal fusion.

Key metrics:

  1. State Accuracy (SA)

SA=1T∑t=1T[ IoUt⋅1(vt=1)+pt⋅1(vt=0) ]SA = \frac{1}{T}\sum_{t=1}^T [\,\mathrm{IoU}_t\cdot \mathbf{1}(v_t=1) + p_t\cdot\mathbf{1}(v_t=0)\,]

where vtv_t flags the presence of a target, IoUt\mathrm{IoU}_t denotes overlap at time tt, and ptp_t is the tracker’s absent/present prediction.

  1. Precision (center error)

P(δ)=#{∥c^t−ct∥<δ}TP(\delta)=\frac{\#\{\|\hat c_t-c_t\|<\delta\}}{T}

typically with δ=20\delta=20 pixels.

  1. Success (IoU\mathrm{IoU})

S(τ)=1T∑t=1T1(IoUt>τ)S(\tau)=\frac{1}{T}\sum_{t=1}^T \mathbf{1}(\mathrm{IoU}_t>\tau)

The success curve S(Ï„)S(\tau) is plotted and area under the curve (AUC) is reported.

4. Dual-Flow Semantic Consistency (DFSC) Tracking Strategy

DFSC is an approach leveraging cross-sequence semantic consistency, exploiting the presence of only one object class, UAV, in the dataset. The method consists of two stages:

  • Class-level Semantic Modulation (CSM): Generates cross-modulated (i≠ji\neq j) and intra-sequence (i=ji=j) features for query-search pairs (zi,xi)(z_i, x_i). The region proposal network is trained to minimize:

LCSM=∑i=1nLrpn(t^ii)+α∑i≠jLrpn(t^ij)L_{CSM} = \sum_{i=1}^n L_{rpn}(\hat{t}_{ii}) + \alpha \sum_{i\neq j} L_{rpn}(\hat{t}_{ij})

where LrpnL_{rpn} combines classification and regression losses, balanced by α≈0.25\alpha\approx0.25.

  • Instance-level Semantic Modulation (ISM): The top KK region proposals for each xjx_j are re-modulated with zjz_j, and the R-CNN head trains:

LISM=1K∑k=1K[Lcls′(sk,sk∗)+βLreg′(pk,pk∗)]L_{ISM} = \frac{1}{K} \sum_{k=1}^K [L_{cls}^\prime(s_k,s_k^*) + \beta L_{reg}^\prime(p_k,p_k^*)]

The total loss is Ltotal=LCSM+LISML_{\mathrm{total}} = L_{CSM} + L_{ISM}. DFSC’s modulation operations are performed offline, yielding zero inference overhead.

Training details for Protocol II:

  • Visible branch: GlobalTrack pre-training, 12 epochs, learning rate annealed from 0.02 to 0.0002.
  • Thermal branch: Faster-RCNN (ImageNet) pre-training, 18 epochs, similar learning rate schedule.
  • Batch size: 2 per GPU, regression uses smooth L1L_1, classification uses cross-entropy.

5. Experimental Performance and Benchmark Analysis

Under Protocol I on the infrared test set, state-of-the-art deep trackers such as SiamRCNN and GlobalTrack reach approximately 66% mSAmSA, 87% precision, and 63% success; correlation-filter trackers lag at 40% mSAmSA. Protocol II results show fine-tuned GlobalTrack gains: vanilla finetune yields 63.86% mSAmSA, DFSC achieves 66.24% mSAmSA (+2.38 pts) on IR test. DFSC advances Visible IR test metrics by ~0.6 AUC and ~0.6 precision points; validation split mSAmSA increases from 72.00% to 80.09% (+8 pts).

Attribute-wise, DFSC brings the most pronounced improvements for:

  • Occlusion (OC): +1.28 pts (validation)
  • Scale Variation (SV): +0.97 pts (validation)
  • Thermal Crossover (TC_all): +1.07 pts (test)

Qualitatively, DFSC demonstrates robustness against heavy occlusion and thermal distractors. No additional inference-time cost is incurred due to the offline nature of semantic modulation.

6. Research Significance and Future Directions

Anti-UAV-RGBT is the first benchmark to provide large-scale, unaligned RGB and thermal video for UAV tracking. It enables critical progress in developing robust anti-UAV surveillance and tracking systems, especially in scenarios with varying illumination, occlusion, and modal transitions. DFSC introduces a training paradigm that leverages dataset-wide semantic consistency, producing stronger, UAV-specific representations; observed empirical gains validate its effectiveness for long-term tracking under challenging conditions.

A plausible implication is that methodologies exploiting cross-sequence or class-level semantics may generalize well to other domains with uniform object categories. Expanding into modalities beyond RGB/T, such as LiDAR or radar, and refining multi-modal fusion are identified as crucial areas for future research (Jiang et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anti-UAV-RGBT Dataset.