Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks (1510.00562v1)

Published 2 Oct 2015 in cs.CV

Abstract: Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification, recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has been reported. This has triggered us to investigate in this paper a new deep architecture which can handle 3D signals more effectively. Specifically, we propose factorized spatio-temporal convolutional networks (FstCN) that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers (called spatial convolutional layers), followed by learning 1D temporal kernels in the upper layers (called temporal convolutional layers). We introduce a novel transformation and permutation operator to make factorization in FstCN possible. Moreover, to address the issue of sequence alignment, we propose an effective training and inference strategy based on sampling multiple video clips from a given action video sequence. We have tested FstCN on two commonly used benchmark datasets (UCF-101 and HMDB-51). Without using auxiliary training videos to boost the performance, FstCN outperforms existing CNN based methods and achieves comparable performance with a recent method that benefits from using auxiliary training videos.

Authors (4)

Lin Sun (65 papers)
Kui Jia (125 papers)
Dit-Yan Yeung (78 papers)
Bertram E. Shi (28 papers)

Citations (526)

View on Semantic Scholar

Summary

Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

The paper "Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks" presents an innovative approach for tackling the challenges associated with recognizing human actions in video sequences. It introduces a new architectural paradigm, Factorized Spatio-Temporal Convolutional Networks (F), designed to efficiently handle the three-dimensional (3D) spatio-temporal signals inherent in video data.

Core Contributions

The authors propose a novel factorization of 3D convolutional kernels, decomposing the convolution process into two sequential stages: 2D spatial convolutions followed by 1D temporal convolutions. This decomposition substantially reduces the computational complexity and the number of parameters required, making it feasible to use existing large-scale image datasets for training spatial filters, thereby mitigating the data insufficiency challenge typically faced in video-based tasks.

Key innovations in this paper include:

Factorized Convolutional Architecture: By adopting a layered approach where spatial features are extracted first, followed by temporal features, the architecture circumvents the need for massive amounts of video data typically needed for training traditional 3D CNNs.
Novel Transformation and Permutation Operator: This operator enables effective temporal convolution, facilitating the sequential processing of spatial features. It aligns feature maps appropriately before they are combined with temporal information, enhancing the network's ability to recognize complex motion patterns.
Video Clip Sampling Technique: To address the issue of sequence alignment, the authors propose a strategy to sample multiple video clips from a single video sequence. This technique aids in learning robust spatio-temporal features, even in cases of misaligned or varied-speed actions.
Sparsity Concentration Index (SCI) Based Score Fusion: The paper introduces an innovative score fusion method that emphasizes score vectors with higher sparsity concentration, thereby enhancing classification accuracy.

Experimental Evaluation

The effectiveness of the proposed network is validated on benchmark datasets, UCF-101 and HMDB-51, where it shows superior performance over traditional CNN-based approaches. Notably, the F architecture achieves comparable results to methods that leverage auxiliary training datasets, without relying on such additional data. Specifically, the network achieves an accuracy of 88.1% on the UCF-101 dataset and 59.1% on the HMDB-51 dataset, outperforming or matching existing state-of-the-art methods.

Implications and Future Directions

This work has significant implications for the field of action recognition:

Reduced Computational Burden: By reducing the complexity of 3D convolutional operations, the approach facilitates the deployment of sophisticated video understanding systems on resource-constrained devices.
Enhanced Robustness: The novel use of video clip sampling bolsters the network's resilience to temporal variabilities in video data, which is critical for applications in diverse real-world environments.

Looking ahead, the techniques introduced could be extended to explore more granular spatio-temporal patterns in videos, potentially integrating multi-modal data. Additionally, the fusion of factorized architectures with transformer models could catalyze further enhancements in capturing long-range dependencies in video sequences.

In summary, the paper provides a solid contribution to human action recognition, demonstrating a thoughtful balance between architectural innovation and practical applicability. The proposed techniques are promising pathways for developing more efficient and robust systems in the field of computer vision, particularly in scenarios involving dynamic human activities.

PDF Markdown

Related Papers

Find Related Papers