UAV-Anti-UAV: Detection, Tracking & Defense

Updated 4 July 2026

UAV-Anti-UAV is a domain focused on securing airspace by classifying, detecting, and tracking small, fast, and unpredictable drones using multi-sensor data.
The field leverages continuous open-world perception, addressing challenges like tiny targets, occlusions, clutter, and sensor uncertainty through collaborative methods.
Interdisciplinary countermeasures—from adaptive chip design to communication jamming and cooperative tracking—enhance resilience against unauthorized UAV intrusions.

UAV-Anti-UAV denotes the problem of securing airspace against unauthorized or hostile drones by building systems that can classify, detect, and track UAVs robustly in realistic conditions (Dong et al., 16 Apr 2025). In operational terms, once a drone is detected, a tracker is needed to continuously estimate its position and trajectory for interception, monitoring, or other countermeasures, but real anti-UAV scenarios are dominated by tiny targets, out-of-view events, clutter, thermal crossover, and multi-sensor uncertainty rather than clean laboratory settings (Xie et al., 31 Jul 2025). Within this broader space, recent work also uses “UAV-Anti-UAV” to denote an air-to-air visual tracking task in which a pursuer UAV tracks a target adversarial UAV from its onboard camera under severe dual-dynamic disturbances (Zhang et al., 8 Dec 2025).

1. Problem scope and task formulations

Early anti-UAV benchmarks largely formalized the problem as single-object tracking with a given first-frame template. In that setting, the UAV is assumed to be visible in the first frame, and the tracker is initialized with that target crop. AntiUAV600 explicitly identifies this as a weak formulation for real anti-UAV scenarios, because UAVs may appear anywhere in the video, not exist in some frames, reappear after being absent, and be very small, fast, and embedded in complex clutter. The shift is summarized by the move from

$\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{Z},\bm{w})$

$\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$

which removes the first-frame template assumption and turns anti-UAV into continuous open-world perception (Zhu et al., 2023).

This reformulation also changed benchmark design. The 3rd Anti-UAV Workshop & Challenge introduced two tracks. In Track 1, the target’s bounding box in the first frame is given, and the system must track that target in every frame by predicting its bounding box; when the target disappears, the tracker must output an invisible mark, namely “no bounding box.” In Track 2, whether a drone exists in the first frame is unknown, and the method must detect and track the drone when it appears; if the target does not exist or disappears, the output should again be invisible or “no bounding box” (Zhao et al., 2023).

A closely related but distinct formulation is air-to-air anti-UAV tracking. The UAV-Anti-UAV benchmark defines a pursuer UAV tracking a target adversarial UAV in the video stream. This differs from standard UAV tracking, where a UAV camera tracks ground targets, and from standard Anti-UAV tracking, where a ground-based camera tracks an airborne UAV. The defining difficulty is the dual-dynamic regime: the tracker UAV is moving rapidly, and the target UAV is also moving rapidly and unpredictably (Zhang et al., 8 Dec 2025).

2. Benchmarks, modalities, and annotated difficulty

The benchmark landscape now spans visible light, thermal infrared, RGB-T, and broader multi-modal sensing. Anti-UAV established a large multi-modal benchmark for UAV tracking with paired RGB and thermal infrared video; later datasets increased scene complexity, target smallness, and absence/reappearance realism; MMAUD added stereo vision, LiDAR, radar, and audio arrays with Leica-generated ground truth; and UAV-Anti-UAV introduced language prompts and air-to-air pursuit video (Jiang et al., 2021, Yuan et al., 2024, Zhang et al., 8 Dec 2025).

Benchmark	Modality	Distinguishing properties
Anti-UAV	RGB + thermal infrared	318 RGB-T video pairs; over 580k manually annotated bounding boxes; attributes and existence flags
DUT Anti-UAV	visible light	detection dataset with a total of 10,000 images; tracking dataset with 20 videos
AntiUAV600	thermal infrared	600 video sequences; 723K total thermal infrared frames; bounding boxes and “Not Exist” labels
MMAUD	stereo vision, LiDAR, radar, audio arrays	50 sequences; Leica-generated ground truth; supports detection, tracking, classification, and trajectory estimation
CST Anti-UAV	thermal infrared	220 video sequences; over 240k high-quality bounding box annotations; complete manual frame-level attribute annotations
UAV-Anti-UAV	RGB video + language prompts	1,810 videos; 1.05 million frames; bounding boxes, a language prompt, and 15 tracking attributes

The benchmark series also reveals a strong thermal-infrared tradition. The 2nd Anti-UAV Workshop & Challenge used 280 high-quality, full HD thermal infrared video sequences split into test-dev and test-challenge, while the 3rd challenge released a training set for the first time and organized 200 thermal infrared video sequences for training, 200 for Track 1 test, and 200 for Track 2 test (Zhao et al., 2021, Zhao et al., 2023).

CST Anti-UAV sharpened benchmark emphasis on tiny UAVs in complex scenes. It uses thermal infrared imagery collected with infrared cameras at 25 fps and resolution $640\times512$ , with 220 video sequences, over 240k high-quality bounding box annotations, and a split of 120 training / 40 validation / 60 test sequences. It classifies object size by bounding-box diagonal length into tiny $(0,10]$ , small $(10,30]$ , normal $(30,50]$ , and large $(50,\infty]$ , and includes 78,224 tiny objects, about 4.5× larger than the tiny-object counts in comparable datasets. It is also the first anti-UAV tracking dataset with complete manual frame-level attribute annotations, including OC, OV, SV, TC, DBC, and CDB (Xie et al., 31 Jul 2025).

3. Detection, tracking, and collaboration paradigms

Vision-based anti-UAV detection has been dominated by small-object-oriented detector design, especially YOLO-style one-stage detectors with architectural modifications for weak targets and cluttered backgrounds. Reported examples include DotD-YOLOv9-C with 92.6% mAP on the Anti2 dataset, ISTD-DETR with 96.2% mAP50 on Anti-UAV410 and 133 FPS, and RTM-UAVDet with AP = 66.1% and 72.4 FPS. The same review groups tracking methods into Siamese-based, self-attention-based, and vision fusion-based families, and reports representative infrared tracking results such as SiamDT with SA = 68.19% on Anti-UAV410 and CAMTracker with 88.56% precision, 66.68% AUC, and 67.10% SA (Ding et al., 14 Jul 2025).

Challenge results show that top anti-UAV trackers are rarely plain frame-to-frame regressors. In the 2nd Anti-UAV Workshop & Challenge, BIT_OITS built SiamSTA on Siam R-CNN and added size/aspect-ratio constraints, spatio-temporal attention, and change detection; COLA Try embedded ECO, KYS, SuperDiMP, and Stark in the LTMU framework with multi-tracker voting and motion enhancement; and JNU combined SuperDiMP, SiamRPN++, and TransT with multi-scale search and sliding-window re-detection. The paper’s summary is explicit: robust Anti-UAV performance depends heavily on re-acquisition, ensemble or multi-scale reasoning, and motion-informed tracking rather than naive frame-to-frame regression (Zhao et al., 2021).

Detector-tracker collaboration is a recurrent design pattern. On DUT Anti-UAV, a simple fusion strategy invokes detection when tracker confidence falls below $\tau_t=0.9$ , allows the detector to override the tracker when detector confidence exceeds $\tau_d=0.9$ and the tracker score, and improves all eight tested trackers significantly after fusion with detection. The paper gives a concrete example: SiamFC + Faster R-CNN(VGG16) improves Success by 23.4% over SiamFC alone, while the best fused result is LTMU + Faster R-CNN(VGG16) with Success 0.664, Norm Pre 0.865, and Precision 0.961 (Zhao et al., 2022).

A more explicit collaboration model appears in EDTC, which alternates between global UAV detection and local UAV tracking and uses an evidential head as a binary target/background switcher. On AntiUAV600, YOLO detection only reaches 0.392 Acc, simple combination of detection + tracking reaches 0.352, and full EDTC with the evidential head reaches 0.486. The same ablation shows that uncertainty-aware switching is central: Tracker1 + SC gives 0.263, whereas Tracker1 + EC gives 0.439; Tracker3 + EC reaches 0.486 (Zhu et al., 2023).

4. Metrics, attributes, and the UAV-Anti-UAV air-to-air benchmark

Anti-UAV evaluation is state-aware because the target may be absent. Anti-UAV and CST Anti-UAV use state accuracy,

$SA = \sum_{t} \frac{IOU_{t} \times \delta\left(v_{t}>0\right) + p_{t} \times \left(1 - \delta\left(v_{t}>0\right)\right)}{T},$

where $\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$ 0 is the intersection-over-union between predicted and ground-truth boxes at frame $\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$ 1, $\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$ 2 is the visibility flag of the ground truth, $\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$ 3 is the predicted location/state used when the target is not visible, $\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$ 4 is the indicator function, and $\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$ 5 is the number of frames. The average over all sequences is denoted mSA (Jiang et al., 2021, Xie et al., 31 Jul 2025).

The workshop benchmarks and AntiUAV600 make the same point in a different way: absence prediction is part of the task. In the Anti-UAV challenge protocol, an empty bounding-box list denotes a “not exist” flag, and the benchmark accuracy rewards IoU when the target exists and rewards predicting an empty box when it does not. AntiUAV600 extends this with

$\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$ 6

with $\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$ 7 and $\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$ 8, thereby penalizing failed localization on visible frames (Zhao et al., 2021, Zhu et al., 2023).

Attribute annotation has become progressively finer. Anti-UAV labels OV, OC, FM, SV, LI, TC, and LR. AntiUAV600 labels OV, OC, FM, SV, IC, DBC, and TS. CST Anti-UAV introduces complete manual frame-level attribute annotations and adds CDB to explicitly capture the presence of multiple dynamic distractors in the background. The CST analysis reports especially poor performance for DBC, TC, CDB, OV, and tiny objects, with out-of-view producing a bimodal performance distribution because the tracker must search and re-localize the target after it disappears (Jiang et al., 2021, Zhu et al., 2023, Xie et al., 31 Jul 2025).

Within this broader literature, UAV-Anti-UAV defines the air-to-air variant most directly associated with the term in recent benchmarking. The benchmark contains 1,810 videos, 1.05 million frames, 9.85 hours total duration, densely annotated bounding boxes, language prompts, and 15 fine-grained attributes. Its MambaSTS baseline integrates spatial, temporal, and semantic learning and is reported as the top performer across all metrics, with AUC $\bm{x}_t^\ast=\arg\max_{\bm{x}_t} f(\bm{X}_t;\bm{w}),$ 9, Pre $640\times512$ 0, cAUC $640\times512$ 1, nPre $640\times512$ 2, and mACC $640\times512$ 3. The same paper emphasizes that mACC is 6.6 points higher than the second-best tracker, SUTrack-B224 $640\times512$ 4, and nearly 17 points above the mean of all 50 trackers $640\times512$ 5, which indicates that there is still significant room for improvement in the UAV-Anti-UAV domain (Zhang et al., 8 Dec 2025).

5. Countermeasures beyond visual tracking

Anti-UAV research also includes communication denial and interference-aware coordination. In a multi-UAV telemetry network, cooperative anti-jamming has been formulated as a Stackelberg game with the jammer as leader and the UAVs as followers, combined with a local altruistic game and a distributed stochastic learning automata algorithm. The UAV subgame is proved to be an exact potential game, a stationary Stackelberg equilibrium exists overall, and the proposed method improves anti-jamming performance by around 64% compared with the existing algorithms; more specifically, it reports about 64% improvement when the number of channels is 6 and about 32% improvement when the number of channels is 2 compared with non-cooperative methods (Su et al., 2021).

At infrastructure scale, joint communications and jamming has been proposed for dual-functional MIMO cellular systems. The formulation minimizes total transmit power while ensuring that each legitimate UE achieves a target rate and each unauthorized UAV is forced below an SINR threshold. A central structural result is that dedicated jamming streams are not needed: the communication signals themselves can be shaped to provide the necessary jamming effect. Simulations report a 1–8 dB transmit-power gain over channel inversion as thresholds vary, gains that can expand from about 2 dB to 13 dB when the number of unauthorized UAVs grows from 2 to 8, and effective operation when $640\times512$ 6 (Li et al., 2024).

Active aerial countermeasures have also been studied. One line uses two guardian UAVs, two noisy real-time distances, anti-synchronization, and an X-Y circular motion combined with vertical jitter to estimate, encircle, monitor, and, if necessary, intercept a hostile UAV. In simulation, once the hostile drone enters the warning zone, the guardians switch to Target 2, the target estimation error drops to about 0.05 m, and the encirclement tracking errors also settle to about 0.05 m; the real-world experiment lasted about 50 seconds and recorded about 6,000 samples with effective update rates around 20–25 Hz (Liu et al., 16 Jun 2025).

A separate swarm-defense line emphasizes capture and escort rather than collision or jamming. The proposed defense UAV swarm operates through four phases—clustering, formation, chase, and escort—and maintains a balancing invariant,

$640\times512$ 7

Its goal is to form a 3D cluster or hemispherical interception/capture formation around the malicious UAV so that the intruder has only a minimum set of safe movement options. Simulation studies show that communication range is a major performance factor, wobbling degrades escort time, and the approach is resilient against communication losses through local clustering and rebalancing (Brust et al., 2018).

6. System implementations and research directions

Algorithmic work is now accompanied by dedicated anti-UAV systems. A 40 nm CMOS hybrid frame/event anti-UAV chip reconstructs binary event frames using run-length encoding, generates region proposals, and adaptively switches between frame mode and event mode based on object size and velocity. The chip achieves 96 pJ/(frame·pixel) in frame mode and 61 pJ/event in event mode at 0.8 V, reaches 98.2% average recognition accuracy on UAVs flying at 50–400 m and moving at 5–80 pixels/s, and reduces NPU computational load by 98.3% on the drone dataset and 97% on the vehicle dataset (Lu et al., 12 Dec 2025).

Multi-modal sensing platforms are equally important for calibration-grade evaluation. MMAUD combines stereo vision, LiDAR, radar, and audio arrays, provides 50 sequences and over 1700 seconds of multi-modal recordings, and uses a Leica Nova MS60 MultiStation with a crystal prism mounted on the UAV to generate ground truth at 5 Hz. In baseline experiments, YOLOX achieves 85.9 mAP for 2D detection, while visual 3D estimation models achieve around 0.54–0.57 m relative error and audio-based estimation reaches errors of 2.64–3.43 m under ambient heavy machinery noise (Yuan et al., 2024).

Recent work has also attacked the data bottleneck directly. A training-free, web-scale pipeline collects UAV-related Internet videos from YouTube, TikTok, and Bilibili, filters them with GPT-4o and CLIP, and produces 3D trajectory hypotheses and UAV type cues. The retained corpus is about 200,000 seconds of video across 2,245 sequences; on MMAUD, zero-shot transfer reaches $640\times512$ 8 m, $640\times512$ 9 m, $(0,10]$ 0 m, $(0,10]$ 1 m, classification accuracy $(0,10]$ 2, and FPS $(0,10]$ 3. The same paper reports a clear data scaling behavior: as the amount of online video data increases, zero-shot transfer performance on the target dataset improves consistently, without any target-domain training (Lei et al., 10 Mar 2026).

The surveys converge on a consistent set of unresolved problems. Persistent gaps include real-time performance, stealth detection, and swarm-based scenarios; benchmark and deployment issues include synchronization, calibration, latency, and compute costs in multi-modal fusion; and method-level directions include multimodal data fusion and heterogeneous sensor collaboration, improving generalization across diverse environments, handling target confusion and occlusion in multi-UAV scenarios, improving robustness and suppressing false detections, and balancing real-time performance and accuracy (Dong et al., 16 Apr 2025, Ding et al., 14 Jul 2025). A plausible implication is that future UAV-Anti-UAV systems will be judged less by isolated tracking accuracy on clean sequences than by how well they combine tiny-target perception, absence reasoning, reappearance handling, multi-sensor collaboration, and deployable computation in realistic counter-UAV conditions.