UCF-ARG Aerial Dataset Benchmark
- UCF-ARG Aerial Dataset is a publicly available video corpus capturing human actions in aerial footage with high resolution and real-world environmental challenges.
- The dataset supports rigorous evaluation with leave-4-actors-out cross-validation across 10 folds, providing reproducible metrics for deep learning models.
- Preprocessing uses optical flow for patch extraction from 1920×1080 videos, yielding over 26,000 patches for benchmarking human vs. nonhuman detection.
The UCF-ARG Aerial Dataset is a publicly available video corpus designed for the development and benchmarking of human detection and recognition algorithms in aerial video sequences. Acquired with a helium-balloon–mounted nonstatic camera at 1920 × 1080 pixel resolution and 60 frames per second, UCF-ARG presents characteristic challenges such as altitude-induced scale variation, camera jitter, dynamic illumination, and extensive background clutter. In its canonical use for human detection tasks, the dataset comprises actions performed by multiple actors and supports rigorous cross-validation protocols, enabling reproducible and comparative evaluation of feature learning approaches and deep models (AlDahoul et al., 1 Jan 2026).
1. Imaging Platform, Video Content, and Scene Composition
The primary imaging system for the UCF-ARG dataset is a nonstatic aerial camera platform, specifically a helium balloon payload. Each video sequence captures real-world outdoor scenes with three parked cars, which serve as consistent sources of background clutter. The spatial resolution is 1920 × 1080 pixels, and the sensor records at 60 fps. Variations in camera altitude introduce significant scale changes in visible actors, increasing the difficulty for automated detection systems.
Within the dataset, 12 actors are recorded performing 10 predefined actions: boxing, carrying, clapping, digging, jogging, open-close trunk, running, throwing, walking, and waving. For each actor–action pairing, four repetitions are provided, yielding a total of 48 unique videos per action. The background remains static in terms of car placement, but nontrivial due to environmental variability and the mobility of both subjects and camera.
2. Annotation Schema and Metadata
UCF-ARG includes comprehensive annotation for each sequence, with metadata provided by the original release. Annotations specify the actor ID (ranging from 1 to 12), action label (1–10), and repetition index (1–4), and each video filename encodes these identifiers. All media are stored in .avi format at the original 1920 × 1080, 60 fps specification. For human detection studies, such as in AlDahoul et al. (AlDahoul et al., 1 Jan 2026), a subset of actions—specifically digging, waving, throwing, walking, and running—are designated as “of interest.” These selection criteria facilitate focused yet challenging detection problem instances by emphasizing actions with substantive human movement.
3. Data Partitioning and Evaluation Protocol
Standard experimental design for UCF-ARG utilizes leave-4-actors-out cross-validation across ten unique folds. In each fold, 4 actors (out of 12) constitute the held-out test set (providing 4 actors × 5 actions × 4 repetitions = 80 videos per fold), while the remaining 8 actors are used for training (8 × 5 × 4 = 160 videos per fold). For computational tractability and controlled redundancy, only 1 out of every 10 frames in each .avi is processed for “patch” extraction, rather than full-frame or framewise annotation.
The resultant aggregated training and test sets, across all folds, include approximately 26,541 raw patches: 5,862 are labeled as containing human actors and 20,679 as nonhuman (background or other moving objects). Performance is quantified via overall classification accuracy, computed as:
where , , , denote counts of true positives, true negatives, false positives, and false negatives, respectively. Fold-wise results are averaged over all cross-validation folds for robust assessment.
4. Preprocessing, Patch Extraction, and Data Augmentation
Preprocessing in the benchmark workflow is grounded in motion-based segmentation via Horn–Schunck optical flow. The process consists of computing dense flow fields between consecutive frames, thresholding the magnitude to identify moving regions, and applying morphological closing to enhance connectedness. Blob analysis is then performed to extract bounding-box patches corresponding to moving “objects,” presumed to include both human and nonhuman (e.g., artifact, vehicle, or clutter) candidates.
For subsequent feature learning, patch images are resized according to the requirements of specific models: S-CNN and Hierarchical Extreme Learning Machine (H-ELM) modules receive pixel grayscale inputs, while the pretrained AlexNet extractor operates on RGB patches. No synthetic augmentation techniques—such as spatial flipping, cropping, or color perturbation—are applied at any stage.
5. Characteristics Table
Key quantitative and descriptive attributes of the UCF-ARG dataset, as used in recent deep learning detection studies, are summarized below:
| Feature | Detail |
|---|---|
| Platform | Helium balloon (nonstatic camera mount) |
| Resolution / Frame rate | 1920×1080 px @ 60 fps |
| Actors | 12 |
| Actions (total / used) | 10 total; 5 used (digging, waving, throwing, walking, running) |
| Repetitions per action/actor | 4 |
| Videos per action | 48 |
| Total videos (5 actions) | 5 × 48 = 240 |
| Train / Test split per fold | 160 videos / 80 videos |
| Cross-validation | Leave-4-actors-out (10 folds) |
| Human vs. nonhuman patches | ~ 5,862 positive vs. ~ 20,679 negative |
| Preprocessing | Optical flow → blob → patch crop → resizing |
| Augmentation | None |
| Evaluation metric | Classification accuracy (%) |
6. Benchmarking and Research Utility
The UCF-ARG dataset’s complexity derives from the combination of varying camera viewpoints, altitude changes affecting human scale, motion blur, and static as well as dynamic background clutter. In the 2026 benchmark presented by AlDahoul et al. (AlDahoul et al., 1 Jan 2026), deep feature learning systems—including supervised convolutional neural networks (S-CNN), pretrained CNN feature extractors (AlexNet), and hierarchical extreme learning machines (H-ELM)—are evaluated on this dataset. The pretrained CNN model achieved an average accuracy of 98.09%, S-CNN reached 95.6% (softmax) and 91.7% (SVM), and H-ELM posted 95.9% accuracy. H-ELM’s total training time was 445 seconds on a standard CPU; S-CNN training required 770 seconds with a high-performance GPU.
Consistent high-performance results across all ten cross-validation folds demonstrate the dataset’s reliability as a benchmark for human/nonhuman classification in aerial video, particularly for data-driven approaches where realistic diversity in scale, subject identity, and background is essential. The absence of synthetic augmentation further situates reported results as direct outcomes of model and preprocessing selection, rather than data inflations.
7. Access, Provenance, and Application Scope
The official UCF-ARG dataset is available at http://crcv.ucf.edu/data/UCF-ARG.php, as originally released by K. Reddy and collaborators at the University of Central Florida in 2017. While the dataset was designed for multi-action recognition and remains broadly relevant to human-motion understanding, its principal contemporary utility is in aerial human detection under real-world conditions requiring robustness to environmental variation and minimal scene control. The dataset’s patch-level structure and grounded annotation scheme permit reproducible algorithmic evaluation and comparison, constituting a reference benchmark for subsequent advances in deep-learning–based video detection methodologies (AlDahoul et al., 1 Jan 2026).