Unmasking the abnormal events in video (1705.08182v3)

Published 23 May 2017 in cs.CV

Abstract: We propose a novel framework for abnormal event detection in video that requires no training sequences. Our framework is based on unmasking, a technique previously used for authorship verification in text documents, which we adapt to our task. We iteratively train a binary classifier to distinguish between two consecutive video sequences while removing at each step the most discriminant features. Higher training accuracy rates of the intermediately obtained classifiers represent abnormal events. To the best of our knowledge, this is the first work to apply unmasking for a computer vision task. We compare our method with several state-of-the-art supervised and unsupervised methods on four benchmark data sets. The empirical results indicate that our abnormal event detection framework can achieve state-of-the-art results, while running in real-time at 20 frames per second.

Citations (266)

View on Semantic Scholar

Summary

The paper presents an unsupervised framework that iteratively trains a binary classifier to unmask abnormal events by eliminating distinguishing features.
It integrates motion and appearance features using 3D gradients and a deep convolutional network to achieve real-time performance at up to 20 fps.
Empirical results on multiple benchmarks show significant improvements over current unsupervised methods and competitive results with supervised approaches.

Unmasking the Abnormal Events in Video: A Comprehensive Overview

The paper "Unmasking the abnormal events in video" introduces an innovative framework for detecting abnormal events in video sequences without the necessity for training data. Utilizing the concept of unmasking from the domain of authorship verification, the authors adapt it to suit the needs of computer vision tasks, specifically abnormal event detection.

The primary premise of the framework is predicated on the iterative training of a binary classifier to distinguish between two sequential video segments, with the gradual elimination of the most discernible features at each step. The persistence of high training accuracy in intermediate classifiers serves as the indicator of abnormal events. The methodology is distinctive for its unsupervised nature and operates efficiently in real-time, achieving up to 20 frames per second.

Key Contributions and Methodology

Novelty in Approach: This framework is the first to apply the unmasking technique, previously applied in text authorship verification, to a computer vision problem—specifically, to detect anomalies in video sequences without prior training data. The adaptation of this technique, which involves the elimination of discriminative features, provides an advantage in distinguishing normal from abnormal sequences.
Feature Extraction: The framework utilizes both motion and appearance features to build a representative model of video frames. The motion features are derived from 3D gradient features extracted from spatio-temporal cubes, while appearance features are obtained from a deep convolutional network, VGG-f, without any fine-tuning due to the unsupervised setup.
Unmasking Technique: The iterative process involves training a linear classifier with high regularization. In each iteration, the classifier assesses the differences between the current and preceding video segments by eliminating the top-weighted features, thus testing the depth of differences. A core hypothesis is that a significant anomaly will maintain high classification accuracy across iterations due to a deeper set of distinguishing features.
Real-time Processing: The designed method processes video sequences in real-time, an essential requirement for practical application in surveillance and security settings. The efficiency is maintained without compromising the accuracy by employing a stride and dividing the video frames into spatial bins.

Empirical Evaluation and Results

The framework's performance was rigorously evaluated on four benchmark datasets: Avenue, Subway, UCSD, and UMN. The results denote significant improvements over existing unsupervised methods and competitive performance when compared to some supervised approaches:

For instance, on the Avenue dataset, the method achieved a frame-level AUC improvement over the state-of-the-art unsupervised method by 2.3%, matching the performance of some supervised models.
On the publicly challenging UCSD dataset, despite no training data, the framework showed comparable results to older supervised methods.

Implications and Future Directions

The implications of this research are multifold, providing a pathway for anomaly detection that circumvents the limitations associated with the lack of training data. The methodology aligns well with environments where building a comprehensive model of all possible normality is untenable.

For future work, potential improvements lie in optimizing the fusion of motion and appearance features, as current results suggest only marginal benefits using the current late fusion strategy. Exploration into more advanced feature fusion or unsupervised deep feature learning may offer enhanced performance.

This paper underscores not only the feasibility of applying unmasking to a new domain but also the prospect of robust unsupervised systems capable of operating effectively in real-time applications. Its findings could notably influence the development of intelligent surveillance systems by offering a scalable, efficient solution for anomaly detection in dynamic environments.

PDF Markdown