Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 85 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 37 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 201 tok/s Pro

GPT OSS 120B 455 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

Aligning Effective Tokens with Video Anomaly in Large Language Models (2508.06350v1)

Published 8 Aug 2025 in cs.CV

Abstract: Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal LLMs (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison LLMs (VLMs) and LLMs, we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.