Learning to Detect Violent Videos using Convolutional Long Short-Term Memory (1709.06531v1)

Published 19 Sep 2017 in cs.CV

Abstract: Developing a technique for the automatic analysis of surveillance videos in order to identify the presence of violence is of broad interest. In this work, we propose a deep neural network for the purpose of recognizing violent videos. A convolutional neural network is used to extract frame level features from a video. The frame level features are then aggregated using a variant of the long short term memory that uses convolutional gates. The convolutional neural network along with the convolutional long short term memory is capable of capturing localized spatio-temporal features which enables the analysis of local motion taking place in the video. We also propose to use adjacent frame differences as the input to the model thereby forcing it to encode the changes occurring in the video. The performance of the proposed feature extraction pipeline is evaluated on three standard benchmark datasets in terms of recognition accuracy. Comparison of the results obtained with the state of the art techniques revealed the promising capability of the proposed method in recognizing violent videos.

Citations (203)

View on Semantic Scholar

Summary

The paper introduces an end-to-end CNN and convLSTM model that captures spatio-temporal features for robust violence detection.
The model uses frame differences to emphasize dynamic changes, achieving up to 100% accuracy on select datasets.
The convLSTM approach significantly reduces parameters compared to traditional LSTMs, illustrating its efficiency in video analysis.

A Comprehensive Analysis of "Learning to Detect Violent Videos using Convolutional Long Short-Term Memory"

The paper "Learning to Detect Violent Videos using Convolutional Long Short-Term Memory" by Swathikiran Sudhakaran and Oswald Lanz presents an innovative approach to the automatic identification of violent activities in videos using deep learning techniques. The authors propose a convolutional neural network (CNN) paired with a convolutional long short-term memory (convLSTM) model to automatically recognize violent content, offering significant improvements over traditional methods that rely heavily on hand-crafted features.

Technical Summary

Core Methodology

The cornerstone of the proposed approach is the integration of CNN and convLSTM, which together form an end-to-end trainable deep neural network model. The CNN is utilized to extract spatial features from individual video frames, while the convLSTM captures spatio-temporal patterns by aggregating these frame-level features in the temporal domain. Notably, the convLSTM is preferred over traditional LSTM due to its convolutional gates, which effectively encode spatial and temporal variations, thus enabling more precise violence detection with fewer parameters.

The authors also propose using the difference between consecutive video frames as the input to the model, rather than the raw frames themselves. This strategy emphasizes modeling the dynamic changes occurring in the video, which are instrumental in detecting violent behavior.

Evaluation and Results

The effectiveness of the proposed model is validated across three benchmark datasets: the Hockey Fight Dataset, the Movies Dataset, and the Violent-Flows Crowd Violence Dataset. The authors adopt a rigorous 5-fold cross-validation scheme to ensure robustness in their evaluation.

Numerical results demonstrate the efficacy of the model, with the proposed method outperforming state-of-the-art techniques in two out of the three datasets. For instance, the model achieves a classification accuracy of 97.1% on the Hockey Fight Dataset and attains a perfect score of 100% on the Movies Dataset. Although the model's performance on the Violent-Flows Dataset (94.57%) did not surpass all existing approaches, it remains competitive and highlights the need for further improvements in scenarios involving large crowds.

Comparative Analysis

A direct comparison with a model incorporating traditional LSTM (with 1000 units) further accentuates the advantages of using convLSTM. The convLSTM-based model not only exhibits superior accuracy but also involves substantially fewer parameters (9.6 million compared to 77.5 million), thereby evidencing the convLSTM’s capability in generating effective video representations while mitigating the risk of overfitting.

Implications and Future Directions

The implications of this research span both practical and theoretical dimensions. On the practical side, developing an automated system capable of real-time violence detection in surveillance videos could significantly bolster public safety and reduce the burden of manual video monitoring. Theoretically, this work contributes to the broader field of understanding complex human interactions in videos using deep learning frameworks.

However, challenges persist, particularly in handling videos with subregions indicating violence while the majority remain passive—a scenario common in crowd violence incidents. Future research may delve into region-based analysis within video frames, potentially splitting them into sub-regions and individually assessing their violent content.

In conclusion, this paper makes a noteworthy contribution to the computational violence detection domain, leveraging advancements in deep learning to address a critical societal need. The proposed methodology sets a promising precedent for subsequent research and development in video analysis and surveillance technologies.

PDF Markdown