VisDrone Dataset Benchmark
- VisDrone dataset is a large-scale UAV benchmark featuring richly annotated imagery for object detection, tracking, and crowd counting.
- The dataset covers urban and suburban scenes across 14 Chinese cities with varying weather, lighting, and object scales.
- VisDrone facilitates robust evaluation of visual algorithms under challenges such as occlusion, scale variation, and dense object distribution.
The VisDrone dataset is a large-scale computer-vision benchmark designed for visual object detection, multiple-object tracking, single-object tracking, and crowd counting using aerial imagery from unmanned aerial vehicles (UAVs). Captured in urban and suburban scenes from 14 Chinese cities using consumer-grade drones under various weather and lighting conditions, VisDrone provides richly annotated data—including bounding boxes, object categories, occlusion, and truncation information—across images and video sequences. It is widely regarded as the largest drone-acquired dataset for such tasks to date, enabling high-fidelity evaluation and development of visual analysis algorithms under real-world constraints characterized by significant occlusion, extreme scale and pose variations, dense object distributions, and fast motion (Zhu et al., 2018, Zhu et al., 2020, Du et al., 2021).
1. Data Acquisition, Coverage, and Composition
VisDrone comprises imagery obtained from DJI Mavic and Phantom UAV platforms (models 3, 3A, 3SE, 3P, 4, 4A, 4P) across Tianjin, Hong Kong, Daqing, Ganzhou, Guangzhou, Jinchang, Liuzhou, Nanjing, Shaoxing, Shenyang, Nanyang, Zhangjiakou, Suzhou, and Xuzhou. The collected data span diverse urban and suburban scenes, sampled under a wide spectrum of weather (cloudy, sunny, night) and varying altitudes/viewpoints, yielding variable object scales and densities.
VisDrone dataset variants include:
- VisDrone (Object Analysis):
- Static images: 10 209 (up to 2000×1500 px, independent from video frames)
- Video clips: 263, totaling 179 264 frames (up to 3840×2160 px)
- Total annotated instances: over 2.5 million
- Task-based breakdown: see Table 1
- VisDrone-CC2020 (Crowd Counting):
- Images: 3 360 frames at 1920×1080 px, from 70 scenarios in five cities; annotated with pedestrian head points
- Total annotated heads: 486 155; mean count per image: 144.7
| Variant | Modality | #Samples | Annotations/Labels |
|---|---|---|---|
| VisDrone2018 | Images | 10 209 | Bounding boxes, 10 categories |
| VisDrone2018 | Video | 263 clips / 179 264 | Frame-wise boxes, IDs, occlusion |
| VisDrone-CC2020 | Images | 3 360 | Head points, scale, illumination |
The dataset’s breadth covers dense traffic, crowded environments, and sparsely populated regions, increasing generalization potential for algorithms (Zhu et al., 2018, Du et al., 2021).
2. Annotation Protocols and Object Taxonomy
Annotations for object detection and tracking consist of axis-aligned bounding boxes for 10 object categories: pedestrian, person (sitting/riding), car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. Each object is assigned
- occlusion ratio: none (0%), partial (1–50%), heavy (>50%)
- truncation ratio: proportion of bounding box outside the image; objects with truncation >50% are ignored in evaluation
Per-object annotation fields (for object detection/tracking tasks):
1 |
[frame_id, object_id, bbox_left, bbox_top, width, height, category_id, occlusion, truncation] |
3. Benchmark Tasks and Definitions
VisDrone supports the following computer-vision task definitions:
- Object Detection in Images: Detect all instances of the 10 predefined categories in single static images, outputting confidence-scored bounding boxes.
- Object Detection in Videos: Detect all object instances per frame in video clips with the same 10 categories, delivering per-frame bounding boxes and confidence scores.
- Single-Object Tracking (SOT): Given an initial bounding box in the first frame, localize the specified target in all subsequent frames (online).
- Multi-Object Tracking (MOT):
- Track 4A: Detection-free: tracker must detect and associate all objects across frames.
- Track 4B: Provided detection boxes: tracker utilizes supplied detections (from Task 2), associating them temporally for consistent object IDs.
- Crowd Counting (VisDrone-CC2020): Estimate the number of people visible using point-level annotations (head centers), ranking algorithms by mean absolute error (MAE) over the test set (Zhu et al., 2018, Du et al., 2021, Zhu et al., 2020).
4. Dataset Splits and Evaluation Procedures
Splits for each benchmark task are defined as follows:
| Task | Train | Validation | Test |
|---|---|---|---|
| Detection in Images | 6 471 images | 548 images | 3 190 images |
| Detection/Tracking in Videos | 56 clips/24 201 | 7 clips/2 819 | 33 clips/12 968 |
| Single-Object Tracking | 86 seq/69 941 | 11 seq/7 046 | 70 seq/62 289 |
| Crowd Counting (CC2020) | 2 460 images | – | 900 images |
Ground-truth annotations are available for training and validation sets; test-set ground-truth is withheld, and participants must submit predictions via the online evaluation server on www.aiskyeye.com (Zhu et al., 2018, Zhu et al., 2020).
5. Evaluation Metrics
Detection (Tasks 1, 2):
- Intersection over Union (IoU):
- Average Precision (AP) per class :
where is precision as a function of recall
- Mean Average Precision (mAP):
where
- COCO-style metrics: (primary), , , at up to 1, 10, 100, 500 detections
Multi-Object Tracking (CLEAR-MOT):
- Multiple Object Tracking Accuracy (MOTA):
- MOTP, IDF1, and standard counts (FAF, MT, ML, FP, FN, IDS, FM, Hz)
Single-Object Tracking:
- Success: Area under curve (AUC) of overlap-threshold plot
- Precision: Percentage of frames with center error ≤20 pixels
Crowd Counting (CC2020):
- MAE: Mean absolute error
- RMSE: Root mean squared error ; MAE is used for primary ranking (Du et al., 2021).
6. Baseline Results and Leaderboards
The original VisDrone challenge paper reports no algorithmic baselines but notes that results from competition participants are hosted on the official website. State-of-the-art detectors and tracking methods have achieved the following best performances in recent challenge years (Zhu et al., 2020):
- DET 2020: DroneEye2020 (DetectoRS) AP=34.57%; “small object” recall remains under 25% for “bicycle” and “person.”
- VID 2020: DBAI-Det (Cascade R-CNN + deformable convolution + context) AP=29.2%
- SOT 2020: LTNMI (ensemble: ATOM, SiamRPN++, etc.) Success=76.5%, Prec=92.3%
- MOT 2020: COFE (cascade R-CNN det) MOTA ≈ 55%
Crowd counting challenge (VisDrone-CC2020) leading results (Du et al., 2021):
- FPNCC: MAE=11.66, RMSE=15.45
- BVCC: MAE=12.36, RMSE=15.19
- CFF: MAE=13.65, RMSE=17.32 Further methods and detailed architectures are reported in the challenge paper and leaderboard.
7. Access, Protocol, and Research Directions
VisDrone datasets can be accessed at www.aiskyeye.com following institutional registration. Downloadable materials include data, annotation definitions, and evaluation scripts. Test set results must be submitted online for scoring; use of supplementary data is permitted if disclosed.
Key challenges exposed by VisDrone include reliable detection and tracking under small object sizes, heavy occlusion, variable density, severe scale variation, and real-time efficiency constraints. Open problems and expansion points articulated by the organizers are: end-to-end joint detection and tracking frameworks, integration of crowd-level counting and segmentation, balancing accuracy with computational cost (with AutoNAS/FP-NAS for resource constraints), and inclusion of multi-modal (e.g., RGB + IR) and tiny-object annotation (Zhu et al., 2020).
VisDrone remains the definitive large-scale drone vision benchmark, catalyzing advances in detection, tracking, and counting in unconstrained aerial imaging scenarios (Zhu et al., 2018, Zhu et al., 2020, Du et al., 2021).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free