Improving Video Violence Recognition with Human Interaction Learning on 3D Skeleton Point Clouds (2308.13866v1)

Published 26 Aug 2023 in cs.CV

Abstract: Deep learning has proved to be very effective in video action recognition. Video violence recognition attempts to learn the human multi-dynamic behaviours in more complex scenarios. In this work, we develop a method for video violence recognition from a new perspective of skeleton points. Unlike the previous works, we first formulate 3D skeleton point clouds from human skeleton sequences extracted from videos and then perform interaction learning on these 3D skeleton point clouds. Specifically, we propose two types of Skeleton Points Interaction Learning (SPIL) strategies: (i) Local-SPIL: by constructing a specific weight distribution strategy between local regional points, Local-SPIL aims to selectively focus on the most relevant parts of them based on their features and spatial-temporal position information. In order to capture diverse types of relation information, a multi-head mechanism is designed to aggregate different features from independent heads to jointly handle different types of relationships between points. (ii) Global-SPIL: to better learn and refine the features of the unordered and unstructured skeleton points, Global-SPIL employs the self-attention layer that operates directly on the sampled points, which can help to make the output more permutation-invariant and well-suited for our task. Extensive experimental results validate the effectiveness of our approach and show that our model outperforms the existing networks and achieves new state-of-the-art performance on video violence datasets.

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel framework using 3D skeleton point clouds to capture spatial-temporal human interactions for improved violence recognition.
It employs Local-SPIL and Global-SPIL modules that utilize weight distribution and self-attention to model both local and global point relationships.
The approach achieves state-of-the-art performance, reaching 90% accuracy on the RWF-2000 dataset and robustly differentiating nuanced actions.

Improving Video Violence Recognition with Human Interaction Learning on 3D Skeleton Point Clouds

Introduction

The paper presents a novel approach for video violence recognition through the formulation of 3D skeleton point clouds derived from human skeleton sequences in videos. Traditional video action recognition models encounter significant challenges in accurately identifying violent actions due to complex multi-dynamic behaviors and the presence of multiple subjects in the video. By leveraging 3D skeleton point clouds, this approach focuses on capturing the spatial-temporal dynamics of human interactions more effectively.

Methods

The paper introduces two Skeleton Points Interaction Learning (SPIL) strategies designed to enhance recognition accuracy:

Local-SPIL: This module uses a weight distribution strategy amongst local regional points, focusing on the most relevant parts based on features and spatial-temporal position information. A multi-head mechanism aggregates different features to capture diverse relationships between points.
Global-SPIL: This module employs a self-attention layer to learn and refine features of unordered and unstructured skeleton points, ensuring that output is permutation-invariant and better suited for the task.
Figure 1: Overview of the framework utilizing pose detection to extract skeleton coordinates, which are processed as point clouds in Local-SPIL and Global-SPIL modules.

Experimental Evaluation

The proposed method achieves state-of-the-art performance across several video violence datasets. Extensive experiments validate the superiority of the combination of Local-SPIL and Global-SPIL modules compared to traditional RGB and flow-based methods.

A key observation is the robustness of the model across diverse datasets, with notable improvements in scenarios with complex backgrounds and multiple interacting subjects.

Figures

Figure 2: Demonstration of how the Local-SPIL module operates on human skeleton point clouds, focusing attention on correlated points.

Figure 3: Depiction of how Global-SPIL module handles the globally unordered points, optimizing through self-attention.

Performance Comparison

On the RWF-2000 dataset, the method outperforms existing methods like TSN and I3D, achieving an accuracy of 90.0%. Additionally, experiments with confusing non-violent actions confirm the model's ability to differentiate nuanced patterns in human skeletal dynamics.

Figure 4: Confusion matrix comparisons illustrating improved classification accuracy for complex examples between models.

Conclusion

The introduction of 3D skeleton point clouds and the SPIL strategy addresses the limitations of traditional video action recognition in violence recognition contexts. This work not only paves a new path for handling complex multi-dynamic video scenarios but also presents opportunities for future enhancements in understanding human interactions through point cloud analysis. The results underline the potential for applications in surveillance systems, providing more reliable and nuanced video violence recognition.