MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection (2104.01633v1)

Published 4 Apr 2021 in cs.CV

Abstract: Weakly supervised video anomaly detection (WS-VAD) is to distinguish anomalies from normal events based on discriminative representations. Most existing works are limited in insufficient video representations. In this work, we develop a multiple instance self-training framework (MIST)to efficiently refine task-specific discriminative representations with only video-level annotations. In particular, MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder that aims to automatically focus on anomalous regions in frames while extracting task-specific representations. Moreover, we adopt a self-training scheme to optimize both components and finally obtain a task-specific feature encoder. Extensive experiments on two public datasets demonstrate the efficacy of our method, and our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.

PDF Abstract

Analyzing the MIST Framework for Video Anomaly Detection

The paper "MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection" proposes a novel methodological approach in the domain of weakly supervised video anomaly detection (WS-VAD). The significance of this work lies in its capacity to refine feature encoders more efficiently under conditions where video-level annotations are the only available labels, addressing the challenges of insufficient video representations inherited from task-agnostic encoders.

Core Contributions and Methodology

The heart of the proposed method, MIST (Multiple Instance Self-Training), lies in its two-stage approach designed to produce task-specific feature representations. It consists of two interconnected components: a multiple instance pseudo label generator and a self-guided attention boosted feature encoder (E_SGA). The generator employs a sparse continuous sampling strategy, enhancing the accuracy of clip-level pseudo labels and minimizing noise that typically accompanies video-level labelling transferred directly to clips.

In the subsequent stage, these pseudo labels refine the feature encoder, with E_SGA focusing on task specificity. The self-guided attention module within E_SGA dynamically prioritizes anomalous video regions, which is a vital step towards refining representations without requiring explicit annotations. The attention mechanism serves dual purposes – enhancing feature discriminativity and facilitating robust anomaly detection.

Performance Evaluation

The authors report extensive experiments on ShanghaiTech and UCF-Crime datasets demonstrating that MIST not only outperforms current state-of-the-art WS-VAD methods but also competes effectively with fully supervised approaches. The results show that MIST achieves a frame-level AUC of 94.83% on the ShanghaiTech dataset. This indicates its robust capability to derive meaningful discursive power from weakly labeled data, handling domain and feature encoder gaps effectively.

Discussion and Implications

The paper presents significant implications for both theoretical research and practical application. Theoretically, MIST's success questions the necessity of intricate label noise learning mechanisms used in earlier methods and highlights the efficiency of pseudo labels derived from more granular sampling approaches. Practically, the results suggest that integrating MIST into surveillance systems could drastically reduce manual labor in anomaly detection while maintaining high detection accuracy.

By demonstrating a consistent generation of informative feature encoders, MIST opens up potential for use in related fields such as weakly supervised action recognition and highlight detection. The proposed framework can also be viewed as a baseline for future research, proposing a paradigm where self-training frameworks are customized to cater to domain-specific requirements in video analysis without the explicit need for dense annotations.

Conclusions

Overall, the MIST framework offers valuable insights into WS-VAD, emphasizing the importance of leveraging pseudo labels and self-guided attention mechanisms to ameliorate the challenge posed by task-agnostic feature encoders. This approach provides a clear path forward for creating more scalable, efficient, and effective video anomaly detection systems applicable across diverse surveillance contexts. Future work could focus on expanding the application of MIST to varied video domains and exploring more sophisticated self-attention mechanisms to further boost its performance.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Jia-Chang Feng (3 papers)
Fa-Ting Hong (19 papers)
Wei-Shi Zheng (148 papers)

Citations (213)

View on Semantic Scholar