Papers
Topics
Authors
Recent
Search
2000 character limit reached

MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection

Published 4 Apr 2021 in cs.CV | (2104.01633v1)

Abstract: Weakly supervised video anomaly detection (WS-VAD) is to distinguish anomalies from normal events based on discriminative representations. Most existing works are limited in insufficient video representations. In this work, we develop a multiple instance self-training framework (MIST)to efficiently refine task-specific discriminative representations with only video-level annotations. In particular, MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder that aims to automatically focus on anomalous regions in frames while extracting task-specific representations. Moreover, we adopt a self-training scheme to optimize both components and finally obtain a task-specific feature encoder. Extensive experiments on two public datasets demonstrate the efficacy of our method, and our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.

Citations (213)

Summary

  • The paper demonstrates that integrating multiple instance learning with a self-guided attention mechanism significantly improves anomaly detection accuracy.
  • It introduces a pseudo label generator using sparse continuous sampling to mitigate noisy video-level labels.
  • Experiments on benchmarks show a frame-level AUC of 94.83% on ShanghaiTech, highlighting enhanced model robustness and efficiency.

Introduction

The paper "MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection" introduces the MIST framework aimed at addressing the challenges of video anomaly detection (VAD) in surveillance settings. VAD is a critical component of intelligent surveillance systems, particularly as the deployment of surveillance cameras continues to increase. The framework leverages weakly supervised learning to efficiently produce discriminative representations using only video-level annotations. Figure 1

Figure 1: Illustration of our proposed MIST framework. MIST includes a multiple instance pseudo label generator GG and self-guided attention boosted feature encoder ESGAE_{SGA}.

Technical Approach

Multiple Instance Learning (MIL)

MIST deploys a multiple instance learning approach through a novel pseudo label generator designed to enhance the reliability of clip-level annotations in video data. The generator uses sparse continuous sampling to focus attention on contextually significant anomalous parts rather than isolated frames, mitigating the label noise common in methods that naively assign video-level labels to clips. Figure 2

Figure 2: The workflow of our multiple instance pseudo label generator.

Self-Guided Attention Mechanism

The framework incorporates a self-guided attention mechanism within a feature encoder, ESGAE_{SGA}. This module autonomously emphasizes anomalous regions in videos, allowing the encoder to focus on extracting relevant task-specific features without external annotations. The benefit of such an attention mechanism is highlighted by its ability to adaptively refine the anomaly detection process in complex surveillance video data. Figure 3

Figure 3: The structure of self-guided attention boosted feature encoder $E_{SGA}.

Experiments and Results

The efficiency and efficacy of the MIST framework are validated with extensive experiments on benchmark datasets, including UCF-Crime and ShanghaiTech. In particular, MIST achieves a frame-level AUC of 94.83% on the ShanghaiTech dataset, demonstrating its superiority over existing methods, both supervised and weakly supervised. Figure 4

Figure 4: Comparisons with the state-of-the-art encoder-based method Zhong et al.

Feature Visualizations

To better understand the discriminative representation learning, feature space visualizations using t-SNE are conducted. These visualizations illustrate the refinements in feature encoding achieved through MIST's self-guided attention system compared to traditional encoder approaches. Figure 5

Figure 5: Feature space visualization of pretrained vanilla feature encoder I3D and the MIST fine-tuned encoder via t-SNE.

Ablation Studies

Ablation studies further corroborate the importance of each component of the MIST framework. Removing the pseudo label generator or self-guided attention module significantly degrades performance, underscoring their critical roles in enhancing anomaly detection capabilities. Experiments also reveal the improved model robustness provided by the sparse continuous sampling strategy within the MIL framework. Figure 6

Figure 6: The effect of TT for a fixed number of sub-bag 32 on ShanghaiTech dataset with $I3D{RGB}.

Conclusion

The MIST framework represents an important advancement in the field of video anomaly detection by successfully integrating self-training with MIL to train task-specific encoders. This combination leads to more accurate and reliable anomaly detection in video data using weak supervision. Future developments are likely to focus on refining pseudo label generation processes and optimizing encoder methods to further improve the efficiency and applicability of anomaly detection in various real-world scenarios.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.