Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection (2308.11681v3)

Published 22 Aug 2023 in cs.CV and cs.MM

Abstract: The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide range of image-level tasks, revealing remarkable ability for learning powerful visual representations with rich semantics. An open and worthwhile problem is efficiently adapting such a strong model to the video domain and designing a robust video anomaly detector. In this work, we propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) by leveraging the frozen CLIP model directly without any pre-training and fine-tuning process. Unlike current works that directly feed extracted features into the weakly supervised classifier for frame-level binary classification, VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP and involves dual branch. One branch simply utilizes visual features for coarse-grained binary classification, while the other fully leverages the fine-grained language-image alignment. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained video anomaly detection by transferring pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, VadCLIP achieves 84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and features are released at https://github.com/nwpu-zxr/VadCLIP.

VadCLIP: Adapting Vision-LLMs for Weakly Supervised Video Anomaly Detection

This paper presents VadCLIP, a paradigm leveraging the pre-trained CLIP model for weakly supervised video anomaly detection (WSVAD). The authors address the challenge of transferring the capabilities of vision-LLMs, originally trained on image-text pairs, to perform efficiently on the more nuanced task of video anomaly detection.

The key innovation of VadCLIP lies in its dual branch structure, which exploits both coarse-grained and fine-grained visual representations. One branch handles visual features for traditional binary classification, while the other employs vision-language alignment to harness semantic associations between video content and textual descriptions. This approach is intended to maximize the utility of CLIP's learned knowledge without further pre-training or fine-tuning, a significant departure from conventional WSVAD methods that predominantly rely on feature extraction and binary classification paradigms.

Empirical results substantiate the effectiveness of VadCLIP. In experiments conducted on the XD-Violence and UCF-Crime datasets, VadCLIP achieved an average precision (AP) of 84.51% and an area under the curve (AUC) of 88.02%, respectively, outperforming state-of-the-art methods by notable margins. These improvements underscore VadCLIP's advantage over both weakly supervised and semi-supervised techniques by fully leveraging cross-modal associations.

From a theoretical standpoint, VadCLIP represents a meaningful step towards domain adaptation in the video context, where temporal dependencies and semantic alignments play a critical role. Noteworthy components contributing to the system's performance include the Local-Global Temporal Adapter (LGT-Adapter) for capturing temporal relations and novel prompt mechanisms that effectively bridge the visual-language gap. The learnable and anomaly-focused visual prompts dynamically refine class embeddings with contextual information, thereby improving the model's discriminative power in distinguishing anomalies.

The MIL-Align mechanism further optimizes vision-language alignment under weak supervision, highlighting an adaptive strategy to utilize unlabeled data in refining the detection capabilities. This methodological shift not only expands the capabilities of CLIP to the video domain but also sets a precedent for similar transformations across different modalities.

Looking ahead, the insights from this work open new avenues for enhancing video anomaly detection systems by integrating state-of-the-art vision-LLMs. Such advancements could contribute significantly to the development of intelligent surveillance and video analysis systems with improved detection accuracy and reduced dependency on extensive labeled datasets.

Future research could explore the implications of leveraging multi-modal data in open-set conditions or incorporating additional modalities, such as audio, for a more holistic understanding of video contexts necessary for precise anomaly detection. This line of investigation will be crucial for further advancing the potential of pre-trained models in complex, real-world anomaly detection scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
  2. Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20(1): 38–56.
  3. MGFN: Magnitude-Contrastive Glance-and-Focus Network for Weakly-Supervised Video Anomaly Detection. volume 37, 387–395.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  5. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14009–14018.
  6. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, 733–742.
  7. Weakly Supervised Video Anomaly Detection via Self-Guided Temporal Discriminative Transformer. IEEE Transactions on Cybernetics.
  8. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904–4916. PMLR.
  9. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP), 3230–3234. IEEE.
  10. Prompting visual-language models for efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 105–124. Springer.
  11. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, 5583–5594. PMLR.
  12. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1395–1403.
  13. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 388–404. Springer.
  14. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
  15. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, 508: 293–304.
  16. Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection. arXiv preprint arXiv:2303.12369.
  17. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734.
  18. Zero-shot temporal action detection via vision-language prompting. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, 681–697. Springer.
  19. Expanding language-image pretrained models for general video recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, 1–18. Springer.
  20. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  21. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18082–18091.
  22. Support vector method for novelty detection. Advances in neural information processing systems, 12.
  23. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6479–6488.
  24. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, 4975–4986.
  25. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489–4497.
  26. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472.
  27. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904.
  28. Weakly-supervised spatio-temporal anomaly detection in surveillance video. arXiv preprint arXiv:2108.03825.
  29. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing, 30: 3513–3527.
  30. Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model. arXiv preprint arXiv:2307.12545.
  31. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, 322–339. Springer.
  32. Weakly supervised audio-visual violence detection. IEEE Transactions on Multimedia, 1674–1685.
  33. Turning a CLIP Model into a Scene Text Detector. arXiv preprint arXiv:2302.14338.
  34. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, 358–376. Springer.
  35. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1237–1246.
  36. Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection. In Proceedings of the AAAI Conference on Artificial Intelligence.
  37. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
  38. Detecting twenty-thousand classes using image-level supervision. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, 350–368. Springer.
  39. ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation. arXiv preprint arXiv:2212.03588.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Peng Wu (119 papers)
  2. Xuerong Zhou (3 papers)
  3. Guansong Pang (82 papers)
  4. Lingru Zhou (2 papers)
  5. Qingsen Yan (33 papers)
  6. Peng Wang (831 papers)
  7. Yanning Zhang (170 papers)
Citations (41)
Github Logo Streamline Icon: https://streamlinehq.com