UCF-Crime Annotation (UCA)

Updated 3 December 2025

UCF-Crime Annotation (UCA) is a comprehensive labeling system that provides segment-level temporal boundaries, anomaly classifications, and natural language descriptions for surveillance videos.
UCA supports both binary (normal/abnormal) and multiclass protocols, enabling robust anomaly detection and diverse downstream applications in computer vision and multimodal AI.
Its integration of manual, AI-assisted, and hybrid workflows reduces annotation time while improving label precision and consistency for reliable evaluation.

UCF-Crime Annotation (UCA) is a set of comprehensive, temporally localized, and linguistically rich labels for the established surveillance video benchmark UCF-Crime, driving both conventional anomaly detection and a new wave of video-and-language understanding methodologies. UCA provides segment-level event boundaries, multiclass anomaly categories, and natural language descriptions—enabling precise training, robust evaluation, and diverse downstream tasks in computer vision and multimodal AI. Multiple independent efforts have contributed complementary UCA protocols, models, and evaluation frameworks, significantly enhancing the utility and granularity of UCF-Crime for academic and applied research.

1. Construction and Scope of UCF-Crime Annotation

UCF-Crime originally consists of 1,900 real-world surveillance videos, spanning 13 anomaly types (abuse, arrest, arson, assault, burglary, explosion, fighting, road accident, robbery, shooting, shoplifting, stealing, vandalism) plus a "normal" class. UCA augments this resource with temporally precise, segment-level annotations:

Manual Temporal Labeling: Annotators watch each video, marking start and end frames (or timestamps) of each anomaly. This transforms coarse video-level tags into fine-grained event intervals. Annotators follow protocols to exclude ambiguous scenes, focusing on clear evidence and aligning boundaries to shot changes (Park et al., 2022, Maqsood et al., 2021).
Linguistic Annotation: Each temporal segment is paired with a natural language sentence (mean 20.15 words) that describes event, actors, appearance, and context. The corpus totals 23,542 sentences over 1,854 videos, covering 110.7 hours (Yuan et al., 2023).
Quality Assurance: Multi-phase review by senior researchers and iterative annotator training (e.g., 2-week course, gold-standard clips) ensure consistency, style harmonization, and precision at the 0.1-second level (Yuan et al., 2023).

Segment records take the form $(\text{video\_id}, \text{start}, \text{end}, \text{sentence})$ ; file outputs are both plain-text and JSON, supporting easy parsing and downstream integration. In specialized protocols (e.g. action-recognition split (Park et al., 2022)), only missing segments are annotated to complete the test set for evaluation.

2. Annotation Schema and Multiclass Protocols

UCA supports label granularities suitable for both binary anomaly detection and multiclass event recognition:

Binary Labels (Normal/Abnormal): Segment assignment flags a frame/interval as either normal ($0$) or anomalous ($1$) (Gutiérrez et al., 20 Oct 2025, Boekhoudt et al., 2021).
Multiclass Categories: Each anomalous interval is further associated with one of 13 behavioral classes or "normal," enabling multiclass training and evaluation. In some AI-augmented setups, labels are distilled to a 5-class schema for annotation efficiency (normal activity, fighting, traffic accident, robbery/burglary, other abnormal) (Gutiérrez et al., 20 Oct 2025).
Event Description: Sentences include detailed context, actor roles, and object descriptions, facilitating multimodal tasks such as temporal sentence grounding, video captioning, and anomaly reasoning (Yuan et al., 2023, Chen et al., 13 Feb 2025).

An example JSON-style annotation record:

{
  "video_id": 0456,
  "label": "Robbery",
  "start_time": 174.6,
  "end_time": 190.2,
  "sentence": "A masked assailant enters the store from the left, rushes to the counter, and threatens the cashier with a handgun. The cashier surrenders the money. This occurs between 174.6 and 190.2 seconds."
}

Editor's term: "Event annotation record."

3. Annotation Techniques: Human, AI-Assisted, and Hybrid Workflows

UCA utilizes both human-centric and AI-augmented annotation protocols:

Manual GUI-Based Annotation: Annotators employ custom timeline GUIs to set in/out points at frame granularity. Quality is ensured by redundant review and harmonization of phrasing (Maqsood et al., 2021, Yuan et al., 2023).
Human-in-the-Loop (HITL) Pipeline: AI-powered zero-shot pre-annotations generated via CLIP ViT-32 encoder (cross-modal image-text similarity) pre-fill likely anomalous segments. Human annotators then verify, merge, relabel, or delete suggestions in a single-iteration scheme, with final correction producing gold annotations (Gutiérrez et al., 20 Oct 2025).
Zero-Shot Model Details: Each frame's embedding is scored against "normal event" and "abnormal event" text prototypes; frames are classified if $\max(s_\text{normal}, s_\text{abnormal}) \geq \tau = 0.30$ . Smoothing is applied by majority vote in 5-frame windows, boundaries are merged, and key-frames selected for visual anchoring (Gutiérrez et al., 20 Oct 2025).
Linguistic Generation: Annotators/baselines generate natural language sentences for each event segment as both human and automatic description (Yuan et al., 2023, Chen et al., 13 Feb 2025).

AI-boosted annotation yields a 23.1% reduction in annotation time (mean), with median time savings of 35% for 72% of annotators, and greater annotation homogeneity (Adjusted Mutual Information: 0.62→0.67; Silhouette Score: 0.28→0.41) (Gutiérrez et al., 20 Oct 2025).

4. Downstream Tasks, Benchmarks, and Evaluation Metrics

UCA enables extensive benchmarking on classical and multimodal tasks:

Anomaly Detection: Frame-level binary or multiclass classification via 3D ConvNets (C3D, I3D), RNNs (MPED-RNN for skeleton motion), or meta-learning MIL schemes (Maqsood et al., 2021, Boekhoudt et al., 2021, Park et al., 2022). Metrics include ROC-AUC, multiclass accuracy, and action-specific AUROCs (e.g., Assault: 0.7487, Road Accident: 0.4171) (Boekhoudt et al., 2021).
Multimodal Tasks:
- Temporal Sentence Grounding in Videos (TSGV): Predict temporal boundaries from text queries; metrics: Recall@K@IoU thresholds (Yuan et al., 2023).
- Video Captioning/Dense Captioning: Generate natural language descriptions for events; metrics: BLEU@n, METEOR, ROUGE-L, CIDEr, SODA_c (Yuan et al., 2023).
- Multimodal Anomaly Detection: Fuse visual (I3D features) and caption-derived text features; micro-averaged AUC (e.g., up to 85.3%) (Yuan et al., 2023).
- MLLM-Based QA Tasks (UCVL): Detection (TF), Classification (AC), Temporal Grounding (TG), Multiple Choice (MCQ), Event Description (ED/AD); metrics: accuracy, top-3 AC, IoU, and rubric-based open-ended assessment (Chen et al., 13 Feb 2025).
Inter-Annotator Agreement and Coherence: Cohen's $\kappa$ for label agreement, Adjusted Mutual Information, and semantic Silhouette Score for clustering consistency (Gutiérrez et al., 20 Oct 2025).

Task	Metric Type	Representative Value
Annotation Time Reduction	$\%$ savings	23.1% mean; 35% median
Inter-Annotator Homogeneity	Adjusted MI	Manual: 0.62; AI-assisted: 0.67
Event Clustering Coherence	Silhouette Score	Manual: 0.28; AI-assisted: 0.41
Anomaly Detection (MAD)	AUC	Visual: 83.1%; Multimodal: 85.3%

No significant correlation is observed between time savings and clip duration or class complexity, indicating broad applicability of the workflow (Gutiérrez et al., 20 Oct 2025).

5. Technical Foundations and Model Architectures

UCA supports a suite of architectures for supervised and multimodal learning:

3D ConvNet Pipeline: Frame-level labeled cubes (16 frames, $170\times170$ ) feed into a fine-tuned C3D network, with normalized convolutional layers and two-stage batch normalization. Cross-entropy loss on 14 classes guides training; spatial augmentation triples data volume (Maqsood et al., 2021).
Cross-Modal Embedding Networks: CLIP (ViT-32) encoder pairs visual and text features, enabling zero-shot anomaly segment proposal (Gutiérrez et al., 20 Oct 2025).
Meta-Learning MIL Detector: MIL ranking loss with MAML meta-update for initializing the anomaly detector towards faster adaptation on unseen subclasses (Park et al., 2022).
Multimodal (MLLM) Benchmarks: LLaVA-OneVision, GPT-4o used for open-ended question-answering, language generation, and rubric evaluation (Chen et al., 13 Feb 2025).
Skeleton-Based RNNs: MPED-RNN reconstructs and predicts pose trajectories (AlphaPose, YOLOv3-spp, PoseFlow) to encode movement anomalies from HR-Crime human-centric subset (Boekhoudt et al., 2021).

Data augmentation includes horizontal/vertical flips (without temporal jittering), tripling effective training samples (Maqsood et al., 2021).

6. Integration into Benchmarks, QA Frameworks, and Challenges

UCA annotations have been reorganized for advanced QA tasks and integrated into benchmarks such as UCVL:

Unified QA Benchmark: Each video is mapped to a record $(v, y_v, t_v^{(s)}, t_v^{(e)}, \mathrm{summary}_v)$ , supporting anomaly detection, top-3 classification, timestamp prediction, multi-choice queries, and open-ended event description (Chen et al., 13 Feb 2025).
Evaluation Aggregation: Weighted sum of accuracy, IoU, and description quality scores quantifies model performance across tasks:

$\mathrm{Total} = 0.15\,S_{\mathrm{TF}} + 0.10\,S_{\mathrm{AC}} + 0.20\,S_{\mathrm{TG}} + 0.15\,S_{\mathrm{ED}} + 0.15\,S_{\mathrm{AD}} + 0.25\,S_{\mathrm{MCQ}}$

Data Quality: Corrupted videos are excluded; labels and descriptions are harmonized. Each category remains as defined in the original (no loss of anomaly granularity) (Chen et al., 13 Feb 2025).
Identified Challenges: Surveillance video characteristics—low resolution, occlusions, long idle intervals—degrade general-purpose model performance. Fine-grained semantics, ROIs, and long-range dependencies challenge temporal modeling. Handling rare anomalies and making use of weak/noisy labels remain open problems (Yuan et al., 2023).

7. Extensions, Limitations, and Recommendations

UCA enables a broad research agenda but each protocol has distinct limitations:

Scope and Granularity: Simplified annotation schemas may not capture all 13 anomalous categories or spatial bounding boxes. Overlapping events, subtle actions, and crowded scenes present annotation ambiguity (Gutiérrez et al., 20 Oct 2025, Yuan et al., 2023).
Model Adaptivity: Single-iteration AI pre-annotation workflows lack online adaptation or iterative fine-tuning. Annotator corrections do not retrain the underlying model (Gutiérrez et al., 20 Oct 2025).
Systematic Biases: Zero-shot CLIP models may introduce biases, especially under ambiguous scenes if annotators are not vigilant (Gutiérrez et al., 20 Oct 2025).
Best Practices: Use high-confidence pre-annotations, open-source annotation interfaces (e.g. Label Studio), annotator training, and content filtering for sensitive footage (Gutiérrez et al., 20 Oct 2025).
Future Directions:
- Incorporate active learning loops for model refinement.
- Extend to spatial and spatio-temporal localization (bounding boxes/tracks).
- Deploy multimodal frameworks (audio, transcripts, visual features).
- Pretrain large video-LLMs on surveillance-specific data.
- Evaluate on professional populations and rare events, expanding robustness and generalization (Yuan et al., 2023, Gutiérrez et al., 20 Oct 2025).

These protocols position UCA as a central resource for advancing anomaly detection, video-language understanding, and multimodal reasoning in surveillance AI.