Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision (2007.04687v2)

Published 9 Jul 2020 in cs.CV

Abstract: Violence detection has been studied in computer vision for years. However, previous work are either superficial, e.g., classification of short-clips, and the single scenario, or undersupplied, e.g., the single modality, and hand-crafted features based multimodality. To address this problem, in this work we first release a large-scale and multi-scene dataset named XD-Violence with a total duration of 217 hours, containing 4754 untrimmed videos with audio signals and weak labels. Then we propose a neural network containing three parallel branches to capture different relations among video snippets and integrate features, where holistic branch captures long-range dependencies using similarity prior, localized branch captures local positional relation using proximity prior, and score branch dynamically captures the closeness of predicted score. Besides, our method also includes an approximator to meet the needs of online detection. Our method outperforms other state-of-the-art methods on our released dataset and other existing benchmark. Moreover, extensive experimental results also show the positive effect of multimodal (audio-visual) input and modeling relationships. The code and dataset will be released in https://roc-ng.github.io/XD-Violence/.

Authors (7)

Peng Wu (119 papers)
Jing Liu (527 papers)
Yujia Shi (3 papers)
Yujia Sun (8 papers)
Fangtao Shao (3 papers)
Zhaoyang Wu (4 papers)
Zhiwei Yang (43 papers)

Citations (260)

View on Semantic Scholar

Summary

The paper presents a multimodal neural network that fuses audio and visual cues to achieve a 5% average precision improvement over state-of-the-art methods.
The authors introduce the XD-Violence dataset with 4754 untrimmed videos spanning 217 hours and six violence categories for comprehensive training.
The HL-Net model features holistic, localized, and score branches that capture long- and short-range dependencies for real-time violence detection.

Multimodal Violence Detection under Weak Supervision: Key Insights and Implications

The focal point of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" is a sophisticated approach to violence detection in video data, leveraging both visual and auditory information. This paper addresses the limitations of existing methodologies through the introduction of a novel large-scale dataset, termed XD-Violence, and a powerful neural network architecture designed to process and interpret this multimodal input under a weakly supervised learning regime.

Dataset Release and Importance

The XD-Violence dataset is a significant contribution due to its comprehensive coverage, comprising 4754 untrimmed videos spanning 217 hours and incorporating both video and audio data. The dataset's diversity is critical, representing multiple scenarios such as movies and wild scenes, and encompassing six distinct categories of violence: Abuse, Car Accidents, Explosion, Fighting, Riot, and Shooting. This scale and variety present a distinct advantage over previous datasets, which tend to be limited in scope, often relying on short, well-trimmed sequences, and lacking audio information. The inclusion of audio signals addresses a crucial gap, as certain violent cues may be more discernible aurally than visually, and vice versa. By providing a platform for more comprehensive training, the dataset facilitates the development of more robust violence detection models.

Model Architecture

The proposed architecture, HL-Net, is distinguished by its integration of three parallel branches: a holistic branch that captures long-range dependencies through similarity priors, a localized branch that models short-range interactions using proximity priors, and an innovative score branch that dynamically adjusts based on the predicted scores. This architecture is complemented by an HLC approximator for online detection, offering a solution to the practical challenge of real-time violence detection, which is vital for applications in surveillance and content moderation. These components collectively ensure that the network balances the capture of both global context and localized nuances within video data.

Empirical Findings and Implications

The empirical evaluation demonstrates notable efficacy, with the proposed method outperforming state-of-the-art techniques on both the new XD-Violence dataset and established benchmarks like UCF-Crime. The paper reports an average precision (AP) advantage of approximately 5% over leading approaches. These results emphasize the effectiveness of multimodal information fusion and the explicit modeling of inter-snippet relationships within video frames.

The paper also highlights the differential impact of various modalities and network components on detection performance. The multimodal integration of audio and visual signals significantly enhances violence detection capabilities, with a 3-4% improvement in AP relative to unimodal inputs.

Broader Implications

From a theoretical perspective, the paper contributes to ongoing conversations around multimodal learning, offering insights into how audio-visual integration can enhance feature representation and anomaly detection in video frames. Practically, the advancements facilitate more accurate and efficient surveillance systems and content moderation technologies, potentially transforming safety and security measures across diverse domains.

Prospective Developments

Future research inspired by this work may explore augmentations of the HL-Net architecture to include richer contextual insights from supplementary data modalities, or to refine online detection strategies further. Moreover, expanding the scope of violence categories in datasets like XD-Violence could improve generalizability and applicability in more nuanced and complex real-world scenarios.

PDF Markdown

Related Papers

GitHub

XD-Violence