VisDrone Dataset Benchmark

Updated 21 November 2025

VisDrone dataset is a large-scale UAV benchmark featuring richly annotated imagery for object detection, tracking, and crowd counting.
The dataset covers urban and suburban scenes across 14 Chinese cities with varying weather, lighting, and object scales.
VisDrone facilitates robust evaluation of visual algorithms under challenges such as occlusion, scale variation, and dense object distribution.

The VisDrone dataset is a large-scale computer-vision benchmark designed for visual object detection, multiple-object tracking, single-object tracking, and crowd counting using aerial imagery from unmanned aerial vehicles (UAVs). Captured in urban and suburban scenes from 14 Chinese cities using consumer-grade drones under various weather and lighting conditions, VisDrone provides richly annotated data—including bounding boxes, object categories, occlusion, and truncation information—across images and video sequences. It is widely regarded as the largest drone-acquired dataset for such tasks to date, enabling high-fidelity evaluation and development of visual analysis algorithms under real-world constraints characterized by significant occlusion, extreme scale and pose variations, dense object distributions, and fast motion (Zhu et al., 2018, Zhu et al., 2020, Du et al., 2021).

1. Data Acquisition, Coverage, and Composition

VisDrone comprises imagery obtained from DJI Mavic and Phantom UAV platforms (models 3, 3A, 3SE, 3P, 4, 4A, 4P) across Tianjin, Hong Kong, Daqing, Ganzhou, Guangzhou, Jinchang, Liuzhou, Nanjing, Shaoxing, Shenyang, Nanyang, Zhangjiakou, Suzhou, and Xuzhou. The collected data span diverse urban and suburban scenes, sampled under a wide spectrum of weather (cloudy, sunny, night) and varying altitudes/viewpoints, yielding variable object scales and densities.

VisDrone dataset variants include:

VisDrone (Object Analysis):
- Static images: 10 209 (up to 2000×1500 px, independent from video frames)
- Video clips: 263, totaling 179 264 frames (up to 3840×2160 px)
- Total annotated instances: over 2.5 million
- Task-based breakdown: see Table 1
VisDrone-CC2020 (Crowd Counting):
- Images: 3 360 frames at 1920×1080 px, from 70 scenarios in five cities; annotated with pedestrian head points
- Total annotated heads: 486 155; mean count per image: 144.7

Variant	Modality	#Samples	Annotations/Labels
VisDrone2018	Images	10 209	Bounding boxes, 10 categories
VisDrone2018	Video	263 clips / 179 264	Frame-wise boxes, IDs, occlusion
VisDrone-CC2020	Images	3 360	Head points, scale, illumination

The dataset’s breadth covers dense traffic, crowded environments, and sparsely populated regions, increasing generalization potential for algorithms (Zhu et al., 2018, Du et al., 2021).

2. Annotation Protocols and Object Taxonomy

Annotations for object detection and tracking consist of axis-aligned bounding boxes for 10 object categories: pedestrian, person (sitting/riding), car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. Each object is assigned

occlusion ratio: none (0%), partial (1–50%), heavy (>50%)
truncation ratio: proportion of bounding box outside the image; objects with truncation >50% are ignored in evaluation

Per-object annotation fields (for object detection/tracking tasks):

1	[frame_id, object_id, bbox_left, bbox_top, width, height, category_id, occlusion, truncation]

For crowd counting (VisDrone-CC2020), head centers are labeled with (x, y) coordinates per frame, with supplementary attributes: scale (small/large), illumination, and density. All annotation files are provided in TXT format, one per image or video frame (Zhu et al., 2018, Du et al., 2021).

3. Benchmark Tasks and Definitions

VisDrone supports the following computer-vision task definitions:

Object Detection in Images: Detect all instances of the 10 predefined categories in single static images, outputting confidence-scored bounding boxes.
Object Detection in Videos: Detect all object instances per frame in video clips with the same 10 categories, delivering per-frame bounding boxes and confidence scores.
Single-Object Tracking (SOT): Given an initial bounding box in the first frame, localize the specified target in all subsequent frames (online).
Multi-Object Tracking (MOT):
- Track 4A: Detection-free: tracker must detect and associate all objects across frames.
- Track 4B: Provided detection boxes: tracker utilizes supplied detections (from Task 2), associating them temporally for consistent object IDs.
Crowd Counting (VisDrone-CC2020): Estimate the number of people visible using point-level annotations (head centers), ranking algorithms by mean absolute error (MAE) over the test set (Zhu et al., 2018, Du et al., 2021, Zhu et al., 2020).

4. Dataset Splits and Evaluation Procedures

Splits for each benchmark task are defined as follows:

Task	Train	Validation	Test
Detection in Images	6 471 images	548 images	3 190 images
Detection/Tracking in Videos	56 clips/24 201	7 clips/2 819	33 clips/12 968
Single-Object Tracking	86 seq/69 941	11 seq/7 046	70 seq/62 289
Crowd Counting (CC2020)	2 460 images	–	900 images

Ground-truth annotations are available for training and validation sets; test-set ground-truth is withheld, and participants must submit predictions via the online evaluation server on www.aiskyeye.com (Zhu et al., 2018, Zhu et al., 2020).

5. Evaluation Metrics

Detection (Tasks 1, 2):

Intersection over Union (IoU): $\mathrm{IoU}(B,G)=\frac{|B \cap G|}{|B \cup G|}$
Average Precision (AP) per class $c$ :

$\mathrm{AP}_c = \int_{0}^{1} p_c(r)\,\mathrm{d}r$

where $p_c(r)$ is precision as a function of recall $r$

Mean Average Precision (mAP):

$\mathrm{mAP} = \frac{1}{C} \sum_{c=1}^{C} \mathrm{AP}_c$

where $C=10$

COCO-style metrics: $\mathrm{AP}^{\mathrm{IoU}=0.50:0.05:0.95}$ (primary), $\mathrm{AP}^{\mathrm{IoU}=0.50}$ , $\mathrm{AP}^{\mathrm{IoU}=0.75}$ , $\mathrm{AR}$ at up to 1, 10, 100, 500 detections

Multi-Object Tracking (CLEAR-MOT):

Multiple Object Tracking Accuracy (MOTA):

$\mathrm{MOTA} = 1 - \frac{\sum_{t} (\mathrm{FN}_t + \mathrm{FP}_t + \mathrm{IDS}_t)}{\sum_{t} \mathrm{GT}_t}$

MOTP, IDF1, and standard counts (FAF, MT, ML, FP, FN, IDS, FM, Hz)

Single-Object Tracking:

Success: Area under curve (AUC) of overlap-threshold plot
Precision: Percentage of frames with center error ≤20 pixels

Crowd Counting (CC2020):

MAE: Mean absolute error $\frac{1}{N} \sum_{i=1}^N |\hat{y}_i - y_i|$
RMSE: Root mean squared error $\sqrt{\frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2}$ ; MAE is used for primary ranking (Du et al., 2021).

6. Baseline Results and Leaderboards

The original VisDrone challenge paper reports no algorithmic baselines but notes that results from competition participants are hosted on the official website. State-of-the-art detectors and tracking methods have achieved the following best performances in recent challenge years (Zhu et al., 2020):

DET 2020: DroneEye2020 (DetectoRS) AP=34.57%; “small object” recall remains under 25% for “bicycle” and “person.”
VID 2020: DBAI-Det (Cascade R-CNN + deformable convolution + context) AP=29.2%
SOT 2020: LTNMI (ensemble: ATOM, SiamRPN++, etc.) Success=76.5%, Prec=92.3%
MOT 2020: COFE (cascade R-CNN det) MOTA ≈ 55%

Crowd counting challenge (VisDrone-CC2020) leading results (Du et al., 2021):

FPNCC: MAE=11.66, RMSE=15.45
BVCC: MAE=12.36, RMSE=15.19
CFF: MAE=13.65, RMSE=17.32 Further methods and detailed architectures are reported in the challenge paper and leaderboard.

7. Access, Protocol, and Research Directions

VisDrone datasets can be accessed at www.aiskyeye.com following institutional registration. Downloadable materials include data, annotation definitions, and evaluation scripts. Test set results must be submitted online for scoring; use of supplementary data is permitted if disclosed.

Key challenges exposed by VisDrone include reliable detection and tracking under small object sizes, heavy occlusion, variable density, severe scale variation, and real-time efficiency constraints. Open problems and expansion points articulated by the organizers are: end-to-end joint detection and tracking frameworks, integration of crowd-level counting and segmentation, balancing accuracy with computational cost (with AutoNAS/FP-NAS for resource constraints), and inclusion of multi-modal (e.g., RGB + IR) and tiny-object annotation (Zhu et al., 2020).

VisDrone remains the definitive large-scale drone vision benchmark, catalyzing advances in detection, tracking, and counting in unconstrained aerial imaging scenarios (Zhu et al., 2018, Zhu et al., 2020, Du et al., 2021).

PDF Markdown Chat (Pro)

References (3)

Vision Meets Drones: A Challenge (2018)

Detection and Tracking Meet Drones Challenge (2020)

VisDrone-CC2020: The Vision Meets Drone Crowd Counting Challenge Results (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VisDrone Dataset.