UCF-Crime Dataset: Video Anomaly Recognition
- UCF-Crime dataset is a comprehensive surveillance corpus comprising 1,900 untrimmed CCTV videos annotated for 13 crime categories plus normal events.
- It features multi-granular annotations—from video-level binary labels to fine-grained frame and sentence descriptions—facilitating temporal localization and captioning tasks.
- The dataset underpins diverse research tasks including anomaly classification, temporal grounding, multimodal QA, and anomaly detection with domain-adapted augmentation techniques.
The UCF-Crime dataset is a foundational resource for research in video-based anomaly recognition, particularly in surveillance contexts. It consists of a large-scale, real-world collection of untrimmed CCTV footage labeled at the video or frame level for events ranging from everyday activity to diverse criminal anomalies. Subsequent derivations, notably UCF-Crime Annotation (UCA), have extended its annotation granularity to enable multimodal, sentence-level event localization and description, further fueling advancements in temporal localization, captioning, and video–language reasoning tasks.
1. Dataset Composition and Structure
The original UCF-Crime corpus comprises 1,900 untrimmed surveillance videos sampled from genuine CCTV environments, spanning a total duration of approximately 129 hours and featuring wide variance in both environmental context and visual quality. Videos are captured at native resolutions of approximately 320×240 pixels with frame rates between 25–30 fps (Maqsood et al., 2021). The dataset includes 13 real-world crime categories—Abuse, Arrest, Arson, Assault, Burglary, Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting, Vandalism, and Road Accident—plus a “Normal” category for non-anomalous footage. Class distributions reflect real-world frequency imbalances.
Subsets have been curated for focused use, such as the 1,699 videos adopted for the UCF-Crime Annotation (UCA) campaign after removing corrupted or low-quality sequences (Yuan et al., 2023, Chen et al., 13 Feb 2025).
2. Annotation Protocols and Schema
2.1 Original Video-Level and Frame-Level Annotations
Initially, UCF-Crime supplied only video-level binary anomaly labels. Several studies introduced finer frame-level annotation, wherein human evaluators marked the start and end frames of visible anomalies. All frames in this interval received label = 1, and all other frames (including entire normal videos) received label = 0. Overlapping or semantically entangled events within a clip were merged unless clearly distinct, a protocol that mitigates ambiguity in event demarcation (Maqsood et al., 2021).
2.2 Sentence-Level and Multigranular Annotations (UCA)
UCA extended this framework with linguistically rich, temporally precise event descriptions: 23,542 sentence-level annotations over 1,854 videos, with segment boundaries recorded to 0.1 s and allowance for overlapping events and multi-granularity where rapid scene changes demand hierarchical description. Annotations cover all 13 anomaly types and normal events, using comprehensive guidelines to ensure consistency and informativeness (average sentence length ≈20 words; noun:verb:adjective ratio ≈2:2:1). Ten annotators with computer science backgrounds, overseen by dedicated review, produced the corpus over approximately two months, achieving high inter-annotator agreement (Yuan et al., 2023).
3. Dataset Splits, Augmentation, and Statistics
Splitting conventions vary with experimental setup. A typical supervised split comprises 38 training and 12 test clips per class (e.g., 532 train, 168 test; no explicit validation set) (Maqsood et al., 2021). UCA provides a 3-way split—train/val/test of 1,165/379/310 videos—covering 73.7, 16.4, and 20.6 hours of annotated footage, respectively (Yuan et al., 2023). Within the UCVL benchmark, the distribution is 1,030 train, 369 val, 300 test, matching UCA (Chen et al., 13 Feb 2025).
Augmentation protocols include horizontal and vertical flips on each 16-frame training segment (no rotations or color jitter) to preserve spatiotemporal coherence, yielding a roughly threefold increase in clip count post-augmentation. Input frames are typically resized to 170×170, normalized to [0,1], and segmented into non-overlapping 16-frame “cubes.”
Table 1: UCF-Crime Core Splits (example statistics) | Subset | Videos | Duration (hours) | Annotation Granularity | |--------------|--------|------------------|------------------------------------| | UCF-Crime | 1,900 | ~129 | Video-level, later frame-level | | UCA (train) | 1,165 | 73.7 | Sentence-level, temporal boundary | | UCA (val) | 379 | 16.4 | Sentence-level, temporal boundary | | UCA (test) | 310 | 20.6 | Sentence-level, temporal boundary |
4. Supported Research Tasks and Baselines
UCF-Crime and its offspring benchmarks underpin a wide spectrum of video understanding and anomaly localization tasks:
4.1 Anomaly Detection and Classification: Early work employed binary or multiclass prediction (e.g., 3D ConvNets for 14-way crime recognition) using softmax outputs and cross-entropy loss (Maqsood et al., 2021), with frame- or video-level AUC as the primary performance metric.
4.2 Temporal Localization and Sentence Grounding: UCA enables Temporal Sentence Grounding in Video (TSGV), requiring the model to output start/end timestamps for textual queries, evaluated via R@K/IoU metrics. Baselines include CTRL, SCDM, 2D-TAN, LGI, MMN, and MomentDiff. Reported R@1 at IoU=0.3 remains below 9% for all methods, indicating persistent challenge (Yuan et al., 2023).
4.3 Captioning and Multimodal Tasks: Video Captioning (VC) and Dense Video Captioning (DVC) tasks generate sentence-level summaries for trimmed or untrimmed clips, with metrics such as BLEU, METEOR, ROUGE-L, and CIDEr. Surveillance-specific methods (e.g., SGN, SwinBERT, CoCap) achieve substantially lower scores on UCF-Crime than in open-domain video, reflecting the domain’s complexity (Yuan et al., 2023).
4.4 Multimodal Anomaly Detection (MAD): Incorporating vision and language signals through models like TEVAD—combining I3D representations with SwinBERT-generated caption embeddings—raises AUC from 83.1% (visual-only) to 84.9% (adding generic captions) and 85.3% with domain-pretrained “Surveillance SwinBERT” captions (Yuan et al., 2023).
4.5 Large-Model QA Benchmarking: UCVL restructures UCF-Crime/Annotation into a six-way, large-model–style video QA benchmark with six question types (binary detection, classification, temporal grounding, MCQ, event description, anomaly description). Questions and answers are fully LLM-generated (Qwen2-72B), supporting assessment of multimodal LLMs (MLLMs) at scale (Chen et al., 13 Feb 2025).
5. Evaluation Metrics and Experimental Protocols
Multiple evaluation metrics are formalized for benchmarking on UCF-Crime and UCA:
- Classification accuracy: Overall, class-wise precision/recall, F1 score (see explicit formulas in (Maqsood et al., 2021)).
- AUC (Area Under ROC): , with both micro- and macro-averaging; micro-AUC of 82% is reported for multiclass 3D ConvNet baseline (Maqsood et al., 2021).
- TSGV metrics: Recall@K at various IoU thresholds; e.g., R@1, IoU=0.3 top value is 8.68% (MMN) (Yuan et al., 2023).
- Captioning metrics: BLEU-n, METEOR, ROUGE-L, CIDEr; typical BLEU-4 scores <7 (Yuan et al., 2023).
- Composite QA scores (UCVL):
where each subscore itself is accuracy, top-3 accuracy, mean IoU, or GPT-4o-graded in the case of open-ended responses (Chen et al., 13 Feb 2025).
6. Research Insights and Challenges
Analysis of UCF-Crime reveals considerable inter- and intra-class variability. Extended, visually salient events (e.g., abuse, explosion, fight, road accident, shoplifting) are recognized robustly (AUC ≈ 0.90–0.95), whereas short-duration or overlapped anomalies (arrest, burglary) yield lower AUC (<0.60). Fine-grained frame-level annotations and simple spatial augmentations (horizontal/vertical flips) substantially improve model accuracy and generalizability by increasing the effective training set and forcing spatiotemporal invariance (Maqsood et al., 2021).
In the multimodal regime, leveraging domain-matched captioners (e.g., Surveillance SwinBERT) measurably improves anomaly detection beyond purely visual baselines (Yuan et al., 2023). However, real-world CCTV is characterized by low resolution, challenging lighting, and sparsity of anomalous actions, making robust temporal segmentation, long-range reasoning, and language grounding highly non-trivial.
Reported results on sentence grounding, captioning, and open-ended video QA indicate that conventional action-recognition and captioning architectures, when ported directly from open-domain datasets, perform poorly in this context. This underscores the ecological complexity of surveillance video and motivates research into hierarchical temporal modeling, domain-adapted pretraining, and human-in-the-loop annotation methodologies.
7. Extensions, Benchmarks, and Directions
Recent developments have transformed UCF-Crime into more demanding resources:
- UCF-Crime Annotation (UCA): Enabling four core multimodal tasks—TSGV, VC, DVC, MAD—by delivering fine-grained, linguistically informed, temporally grounded event descriptions (Yuan et al., 2023).
- UCVL Benchmark: Organizing UCF-Crime and UCA into a large-model–oriented video QA benchmark with 16,990 LLM-generated question–answer pairs spanning detection, classification, temporal segmentation, reasoning, and open-ended synthesis. Includes detailed, automated scoring pipelines and supports benchmarking of MLLMs, with evaluation synthesized into a unified scoring protocol (Chen et al., 13 Feb 2025).
Ongoing directions include multimodal domain adaptation, efficient semi- and weakly-supervised learning for rare events, real-time retrieval across large surveillance streams, and explanatory anomaly detection with causal and rationale-aware outputs. Incorporation of audio, scene metadata, and cross-camera information further presents opportunities for richer environmental modeling. A plausible implication is that continual research attention will focus on developing foundation models that natively encode both visual and linguistic traits specifically tuned to surveillance video domains.