Anomaly Detection in Video Sequence with Appearance-Motion Correspondence (1908.06351v1)

Published 17 Aug 2019 in cs.CV, cs.LG, and cs.NE

Abstract: Anomaly detection in surveillance videos is currently a challenge because of the diversity of possible events. We propose a deep convolutional neural network (CNN) that addresses this problem by learning a correspondence between common object appearances (e.g. pedestrian, background, tree, etc.) and their associated motions. Our model is designed as a combination of a reconstruction network and an image translation model that share the same encoder. The former sub-network determines the most significant structures that appear in video frames and the latter one attempts to associate motion templates to such structures. The training stage is performed using only videos of normal events and the model is then capable to estimate frame-level scores for an unknown input. The experiments on 6 benchmark datasets demonstrate the competitive performance of the proposed approach with respect to state-of-the-art methods.

Citations (324)

View on Semantic Scholar

Summary

The paper introduces a novel CNN architecture that integrates a Conv-AE and U-Net via a shared encoder to capture appearance-motion correspondence in video.
It employs a patch-based anomaly score to localize irregular events, enhancing detection sensitivity in complex surveillance scenarios.
Experimental results on benchmark datasets demonstrate improved AUC and AP metrics, confirming the method’s robustness in real-world applications.

Anomaly Detection in Video Sequence with Appearance-Motion Correspondence

The paper by Trong-Nguyen Nguyen and Jean Meunier presents a novel approach to anomaly detection in surveillance video through the integration of convolutional neural networks (CNNs). The proposed model leverages a combination of auto-encoder (Conv-AE) and U-Net structures sharing an encoder to establish a correspondence between typical appearances and their associated motions. This work addresses the challenges inherent in anomaly detection, particularly the variety of possible anomalous events, which makes manual surveillance a resource-intensive process.

Model Architecture

The model is divided into two major streams: an appearance learning stream and a motion prediction stream. The appearance stream uses a Conv-AE designed to learn regular spatial structures from video frames. It strives to reconstruct these learned structures during the evaluation phase, highlighting anomalies when reconstruction deviates significantly from input. In parallel, the motion stream utilizes a U-Net configuration to predict optical flows, capturing typical motions associated with these encoded appearances.

To bolster performance, an Inception module modifies the convolutional operations in the model to let it choose its receptive field dynamically, accommodating variations in object sizes and positions due to the fixed camera perspective associated with surveillance systems. This strategic architecture minimizes the risk of information loss attributed to a bottleneck that accompanies the conventional encoder-decoder models.

Methodological Innovations

Nguyen and Meunier introduce an interesting approach for score estimation that differs from traditional methods relying on whole-frame analysis. Instead, they propose a patch-based anomaly score that evaluates the most anomalous patch, thus focusing more localized regions of interest. This method enhances the sensitivity of anomaly detection, tackling the issue of missing anomalies in larger frames when the noise dominates.

Experimental Evaluation

The effectiveness of this model was tested against six benchmark datasets, including CUHK Avenue and UCSD Ped2. The model demonstrated competitive performance, surpassing other state-of-the-art methods in several cases with regard to AUC and AP metrics. These results highlight the potential of appearance-motion correspondence in capturing anomalies that otherwise would blend into complex video sequences.

Theoretical Implications and Future Directions

This paper contributes to the theoretical understanding of how spatial-temporal features can be coupled for improved anomaly detection. The design suggests that sharing a common encoder between appearance and motion analysis can efficiently capture dependencies that are informative in recognizing abnormal events.

Future directions may include exploring more advanced multi-task learning strategies that might strengthen the representation power, further refining the model's robustness against variations in lighting and occlusion commonly encountered in real-world environments.

Further research can also extend to unsupervised and semi-supervised learning paradigms where labeled anomalies are minimal. Additionally, combining this approach with anomaly explanation procedures could provide more interpretable results, thus enhancing its application in dynamic and complex surveillance scenarios.

In summary, this paper proposes a sophisticated framework for anomaly detection using the synthesis of visual appearance and motion data. The results showcase the efficiency of the architecture and highlight significant opportunities for both practical applications and further theoretical exploration within the domain of video analytics.

PDF Markdown