Papers
Topics
Authors
Recent
Search
2000 character limit reached

Evaluation of Vision-LLMs in Surveillance Video

Published 27 Oct 2025 in cs.CV | (2510.23190v1)

Abstract: The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision-LLMs (VLMs) by framing anomalous action recognition as a zero-shot, language-grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video. Specifically, we investigate whether small, pre-trained vision--LLMs can act as spatially-grounded, zero-shot anomaly detectors by converting video into text descriptions and scoring labels via textual entailment. We evaluate four open models on UCF-Crime and RWF-2000 under prompting and privacy-preserving conditions. Few-shot exemplars can improve accuracy for some models, but may increase false positives, and privacy filters -- especially full-body GAN transforms -- introduce inconsistencies that degrade accuracy. These results chart where current vision--LLMs succeed (simple, spatially salient events) and where they falter (noisy spatial cues, identity obfuscation). Looking forward, we outline concrete paths to strengthen spatial grounding without task-specific training: structure-aware prompts, lightweight spatial memory across clips, scene-graph or 3D-pose priors during description, and privacy methods that preserve action-relevant geometry. This positions zero-shot, language-grounded pipelines as adaptable building blocks for embodied, real-world video understanding. Our implementation for evaluating VLMs is publicly available at: https://github.com/pascalbenschopTU/VLLM_AnomalyRecognition

Summary

  • The paper introduces a novel zero-shot anomaly detection method by reframing video analysis as a language-driven inference task.
  • The methodology generates textual descriptions from video frames and applies Natural Language Inference for flexible, prompt-based classification.
  • Results highlight benefits from few-shot prompting while noting that privacy filters can degrade accuracy due to video inconsistencies.

Evaluation of Vision-LLMs in Surveillance Video

Introduction

The paper "Evaluation of Vision-LLMs in Surveillance Video" (2510.23190) addresses the critical challenge of detecting anomalous or criminal events from vast surveillance video datasets. Due to the overwhelming volume of video data, manual monitoring becomes impractical, necessitating advanced automated systems capable of real-world video understanding. This work investigates the capabilities of Vision-LLMs (VLMs) for zero-shot anomaly detection, assessing their ability to recognize unexpected events through spatial reasoning grounded in language descriptions.

Spatial Reasoning in Vision-LLMs

The authors propose a novel approach where small, pre-trained vision-LLMs serve as anomaly detectors by transforming video content into textual descriptions and leveraging textual entailment for label scoring. Such models focus on spatially salient events while struggling with complex scenarios due to noisy spatial cues and identity obfuscation. This enables a promising pathway for zero-shot anomaly detection, especially in scenarios characterized by sparse data labels. Figure 1

Figure 1: Few-shot prompting impact on models accuracy level.

Methodology

The core methodology involves reconfiguring anomaly classification as a language-grounded reasoning task rather than conventional pixel-label mapping. The framework operates in two main stages: generating textual descriptions from video input and applying zero-shot classification through Natural Language Inference (NLI). The innovative element here is the reframing of video understanding tasks that bypass extensive model retraining, relying instead on the inherent world knowledge of LLMs.

Derivation and Practical Implications

The framework derives from a sequence of RGB frames X=(xt)t=1TX=(x_t)_{t=1}^{T}, generating descriptions using vision-LLM FθF_{\theta} and scoring with NLI classifier gϕg_{\phi}. Practical advantages include true zero-shot recognition flexibility, where new anomaly types can simply be added to textual label sets, and modular architecture permitting seamless model upgrades. Figure 2

Figure 2: Images from UCF-Crime dataset used for few-shot prompting.

Experimental Setup

Experiments utilize benchmark datasets like UCF-Crime and RWF-2000, evaluating models' zero-shot capabilities under varied experimental conditions, including few-shot prompting and privacy filters. Metrics such as class-averaged Top-1 accuracy reveal insights into models' ability to adapt and generalize across these conditions. The experimental design ensures that any performance differences arise from either prompting or privacy transformations rather than extraneous factors. Figure 3

Figure 3

Figure 3: Example A.

Results

Prompting Effects

Analysis of prompting techniques demonstrates that few-shot examples can enhance accuracy but may induce higher false-positive rates. This trade-off underscores the complexity of fine-tuning model sensitivity while maintaining high precision in anomaly detection tasks. Figure 4

Figure 4: All classes compared over prompting experiments, the few-shot examples include: Fighting, RoadAccidents, Shooting, and Stealing.

Privacy Filters

Privacy-preserving filters are critical for real-world deployment yet result in accuracy degradation, with GAN-based full-body anonymization causing the most significant disruption due to introducing video inconsistencies. Models like VideoLLaMA-3 indicate potential in reducing false positives under these conditions, highlighting a direction for future research to improve video consistency.

Conclusions

The investigation into small vision-LLMs for zero-shot anomaly detection elucidates key considerations in model prompting and privacy-preservation strategies. While such models show efficacy in simple tasks such as fight detection, there remains substantial room for improvement in handling complex scenarios autonomously. Future work may emphasize enhancing privacy methods for increased temporal consistency and optimizing model configurations to balance sensitivity and precision effectively. This work contributes a foundational understanding of the applicability and limitations of vision-LLMs in dynamic, real-world video analysis settings.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

There was an error generating the whiteboard.

Explain it Like I'm 14

What this paper is about

This paper asks a big question: Can small AI models that “look” at videos and “talk” about them (called vision–LLMs, or VLMs) spot unusual or criminal events in security camera footage without any extra training? The authors test these AIs on real surveillance datasets and also check how privacy tools (like blurring faces) affect what the AIs can recognize.

What questions did the researchers ask?

  • Can current vision–LLMs recognize rare, unusual actions (like fighting or theft) in videos they’ve never been trained on, just by using smart instructions (prompts) and a few examples?
  • If we hide people’s identities with privacy filters (face blur or swapping faces/bodies), does the AI still recognize the actions correctly?

How did they do it?

Turning video into words

Instead of training a special model for each crime (which needs lots of labeled data), the team uses a “zero-shot” approach—meaning no extra training. They:

  1. Feed a short video clip into a vision–LLM.
  2. Ask the model for a short description of what’s happening (for example: “Two people shove and swing fists near a car.”).

Think of it like having the AI write a quick caption for each video.

Matching words to action labels

Next, they check whether the caption fits one of the target action labels (like “Fighting,” “Robbery,” or “Normal”). They use a language tool (an NLI classifier) that answers: “Does this description support the label?” The label with the strongest match is picked.

Example: If the caption says “Two people throw punches,” it should strongly match “Fighting.”

Prompts and “few-shot” examples

  • Prompts are the instructions they give the AI (unguided vs guided). Guided prompts list the possible classes and ask the AI to choose one.
  • “Few-shot prompting” means showing the AI a handful of labeled examples first (like a picture and label for “Shooting” or “Stealing”) to set the pattern. This sometimes helps the AI make better choices.

Privacy filters

Because surveillance often needs to protect people’s identities, they tested three privacy methods:

  • Blur faces.
  • Replace faces with AI-generated faces (a “GAN,” which is a kind of image-generation tool).
  • Replace the full body with an AI-generated version.

These methods hide identity, but they can also change visual details the AI uses to understand actions—especially if the replacements flicker or look inconsistent across frames.

What they tested on

  • Datasets: UCF-Crime (13 types of unusual activities plus “Normal”) and RWF-2000 (videos labeled “Fighting” or “Normal”).
  • Models: Four small, open-source vision–LLMs (4–8 billion parameters).
  • Metrics: Mainly Top-1 accuracy (how often the top guess is correct) and false positives (how often the model raises an alarm on normal videos).

What did they find?

  • Few-shot examples can help—but can also backfire:
    • Showing a few labeled examples before testing sometimes boosted accuracy for certain models (especially Gemma-3 and NVILA).
    • However, it also often increased false alarms, meaning the AI was more likely to call a normal video “suspicious.”
  • Privacy filters reduce accuracy:
    • Blurred or AI-replaced faces caused small to moderate drops in accuracy.
    • Full-body AI replacement (GAN full-body) hurt the most, likely because it made people look inconsistent from frame to frame, which confuses motion and action cues.
    • In many cases, false alarms went up when privacy filters were used.
  • Works better for obvious actions:
    • The models were more reliable on easy-to-spot events with clear movement and space cues (like fighting).
    • They struggled when the scene was noisy, the cues were subtle, or when identity-hiding changed the look or motion too much.
  • Models aren’t all the same:
    • Different models reacted differently to prompts and privacy filters. For example, one model showed fewer false alarms with certain GAN filters, while others got worse.

Why this matters: In real life, too many false alarms waste operators’ time, and any privacy protection should not break the system’s ability to spot danger.

Why this research matters and what could come next

This study shows that small, off-the-shelf vision–LLMs can be a helpful starting point for understanding surveillance video without extra training. That’s useful when you have tons of cameras and limited labeled data. But there are trade-offs:

  • Few-shot prompting can raise accuracy but also raise false alarms.
  • Stronger privacy protections (especially full-body replacements) currently make recognition harder.

The authors suggest practical ways to improve without retraining:

  • Better, structure-aware prompts that guide the AI to focus on who is where and doing what.
  • A simple memory across video clips so the AI tracks actions over time, not just frame by frame.
  • Using extra hints like scene graphs or 3D body poses to keep the action’s geometry clear.
  • Privacy tools that hide identity but keep the motion and body layout consistent.

In short, this approach is promising for straightforward cases and could help human operators keep up with massive video streams. With smarter prompts, light-weight memory, and improved privacy methods that keep action cues intact, these zero-shot, language-grounded systems could become more reliable building blocks for real-world, privacy-aware video understanding. The team also shared their evaluation code, which helps others test and improve these ideas.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.