Papers
Topics
Authors
Recent
2000 character limit reached

UCF-ARG Aerial Dataset Benchmark

Updated 8 January 2026
  • UCF-ARG Aerial Dataset is a publicly available video corpus capturing human actions in aerial footage with high resolution and real-world environmental challenges.
  • The dataset supports rigorous evaluation with leave-4-actors-out cross-validation across 10 folds, providing reproducible metrics for deep learning models.
  • Preprocessing uses optical flow for patch extraction from 1920×1080 videos, yielding over 26,000 patches for benchmarking human vs. nonhuman detection.

The UCF-ARG Aerial Dataset is a publicly available video corpus designed for the development and benchmarking of human detection and recognition algorithms in aerial video sequences. Acquired with a helium-balloon–mounted nonstatic camera at 1920 × 1080 pixel resolution and 60 frames per second, UCF-ARG presents characteristic challenges such as altitude-induced scale variation, camera jitter, dynamic illumination, and extensive background clutter. In its canonical use for human detection tasks, the dataset comprises actions performed by multiple actors and supports rigorous cross-validation protocols, enabling reproducible and comparative evaluation of feature learning approaches and deep models (AlDahoul et al., 1 Jan 2026).

1. Imaging Platform, Video Content, and Scene Composition

The primary imaging system for the UCF-ARG dataset is a nonstatic aerial camera platform, specifically a helium balloon payload. Each video sequence captures real-world outdoor scenes with three parked cars, which serve as consistent sources of background clutter. The spatial resolution is 1920 × 1080 pixels, and the sensor records at 60 fps. Variations in camera altitude introduce significant scale changes in visible actors, increasing the difficulty for automated detection systems.

Within the dataset, 12 actors are recorded performing 10 predefined actions: boxing, carrying, clapping, digging, jogging, open-close trunk, running, throwing, walking, and waving. For each actor–action pairing, four repetitions are provided, yielding a total of 48 unique videos per action. The background remains static in terms of car placement, but nontrivial due to environmental variability and the mobility of both subjects and camera.

2. Annotation Schema and Metadata

UCF-ARG includes comprehensive annotation for each sequence, with metadata provided by the original release. Annotations specify the actor ID (ranging from 1 to 12), action label (1–10), and repetition index (1–4), and each video filename encodes these identifiers. All media are stored in .avi format at the original 1920 × 1080, 60 fps specification. For human detection studies, such as in AlDahoul et al. (AlDahoul et al., 1 Jan 2026), a subset of actions—specifically digging, waving, throwing, walking, and running—are designated as “of interest.” These selection criteria facilitate focused yet challenging detection problem instances by emphasizing actions with substantive human movement.

3. Data Partitioning and Evaluation Protocol

Standard experimental design for UCF-ARG utilizes leave-4-actors-out cross-validation across ten unique folds. In each fold, 4 actors (out of 12) constitute the held-out test set (providing 4 actors × 5 actions × 4 repetitions = 80 videos per fold), while the remaining 8 actors are used for training (8 × 5 × 4 = 160 videos per fold). For computational tractability and controlled redundancy, only 1 out of every 10 frames in each .avi is processed for “patch” extraction, rather than full-frame or framewise annotation.

The resultant aggregated training and test sets, across all folds, include approximately 26,541 raw patches: 5,862 are labeled as containing human actors and 20,679 as nonhuman (background or other moving objects). Performance is quantified via overall classification accuracy, computed as:

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

where TPTP, TNTN, FPFP, FNFN denote counts of true positives, true negatives, false positives, and false negatives, respectively. Fold-wise results are averaged over all cross-validation folds for robust assessment.

4. Preprocessing, Patch Extraction, and Data Augmentation

Preprocessing in the benchmark workflow is grounded in motion-based segmentation via Horn–Schunck optical flow. The process consists of computing dense flow fields between consecutive frames, thresholding the magnitude to identify moving regions, and applying morphological closing to enhance connectedness. Blob analysis is then performed to extract bounding-box patches corresponding to moving “objects,” presumed to include both human and nonhuman (e.g., artifact, vehicle, or clutter) candidates.

For subsequent feature learning, patch images are resized according to the requirements of specific models: S-CNN and Hierarchical Extreme Learning Machine (H-ELM) modules receive 100×100100 \times 100 pixel grayscale inputs, while the pretrained AlexNet extractor operates on 227×227227 \times 227 RGB patches. No synthetic augmentation techniques—such as spatial flipping, cropping, or color perturbation—are applied at any stage.

5. Characteristics Table

Key quantitative and descriptive attributes of the UCF-ARG dataset, as used in recent deep learning detection studies, are summarized below:

Feature Detail
Platform Helium balloon (nonstatic camera mount)
Resolution / Frame rate 1920×1080 px @ 60 fps
Actors 12
Actions (total / used) 10 total; 5 used (digging, waving, throwing, walking, running)
Repetitions per action/actor 4
Videos per action 48
Total videos (5 actions) 5 × 48 = 240
Train / Test split per fold 160 videos / 80 videos
Cross-validation Leave-4-actors-out (10 folds)
Human vs. nonhuman patches ~ 5,862 positive vs. ~ 20,679 negative
Preprocessing Optical flow → blob → patch crop → resizing
Augmentation None
Evaluation metric Classification accuracy (%)

6. Benchmarking and Research Utility

The UCF-ARG dataset’s complexity derives from the combination of varying camera viewpoints, altitude changes affecting human scale, motion blur, and static as well as dynamic background clutter. In the 2026 benchmark presented by AlDahoul et al. (AlDahoul et al., 1 Jan 2026), deep feature learning systems—including supervised convolutional neural networks (S-CNN), pretrained CNN feature extractors (AlexNet), and hierarchical extreme learning machines (H-ELM)—are evaluated on this dataset. The pretrained CNN model achieved an average accuracy of 98.09%, S-CNN reached 95.6% (softmax) and 91.7% (SVM), and H-ELM posted 95.9% accuracy. H-ELM’s total training time was 445 seconds on a standard CPU; S-CNN training required 770 seconds with a high-performance GPU.

Consistent high-performance results across all ten cross-validation folds demonstrate the dataset’s reliability as a benchmark for human/nonhuman classification in aerial video, particularly for data-driven approaches where realistic diversity in scale, subject identity, and background is essential. The absence of synthetic augmentation further situates reported results as direct outcomes of model and preprocessing selection, rather than data inflations.

7. Access, Provenance, and Application Scope

The official UCF-ARG dataset is available at http://crcv.ucf.edu/data/UCF-ARG.php, as originally released by K. Reddy and collaborators at the University of Central Florida in 2017. While the dataset was designed for multi-action recognition and remains broadly relevant to human-motion understanding, its principal contemporary utility is in aerial human detection under real-world conditions requiring robustness to environmental variation and minimal scene control. The dataset’s patch-level structure and grounded annotation scheme permit reproducible algorithmic evaluation and comparison, constituting a reference benchmark for subsequent advances in deep-learning–based video detection methodologies (AlDahoul et al., 1 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to UCF-ARG Aerial Dataset.