Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification (1711.08200v1)

Published 22 Nov 2017 in cs.CV

Abstract: The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models variable temporal convolution kernel depths. We embed this new temporal layer in our proposed 3D CNN. We extend the DenseNet architecture - which normally is 2D - with 3D filters and pooling kernels. We name our proposed video convolutional network Temporal 3D ConvNet'~(T3D) and its new temporal layerTemporal Transition Layer'~(TTL). Our experiments show that T3D outperforms the current state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D ConvNets is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D ConvNets is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and finetuned on the target datasets, e.g. HMDB51/UCF101. The T3D codes will be released

PDF Abstract

Insights into Temporal 3D ConvNets for Video Classification

This paper presents a novel approach in the domain of video classification, focusing on the use of temporal cues for the purpose of human action recognition. The proposed architecture, referred to as Temporal 3D ConvNets (T3D), represents an advancement over traditional spatio-temporal convolutional networks by addressing the limitations in capturing long-range temporal dependencies. The T3D network introduces a Temporal Transition Layer (TTL) that employs variable temporal convolution kernel depths, allowing it to model temporal dynamics over varying durations rather than being constrained to fixed temporal depths.

The experimental evaluation validates the effectiveness of T3D over existing methods, with notable improvements demonstrated across the HMDB51, UCF101, and Kinetics datasets. The T3D achieves an accuracy improvement by outperforming key contemporary networks which include Inception3D and ResNet3D, specifically achieving 90.3% on UCF101 and 59.2% on HMDB51. These results underline the T3D's capability in effectively capturing both the appearance and temporal features inherent in action recognition tasks.

Another pivotal contribution of the paper is the innovative application of transfer learning within the paradigm of 3D ConvNets. By transferring knowledge from pre-trained 2D CNN architectures, such as those trained on ImageNet, to initialize 3D CNNs, the authors address the computational challenges associated with training 3D ConvNets from scratch. This transfer learning approach significantly optimizes computational resources and improves performance on video datasets that vary in size and complexity, as seen in the supervised transfer results on UCF101 which reported a 91.7% accuracy.

The research contributes to both practical implementations and theoretical knowledge, emphasizing the role of temporal information in video understanding. Practically, the efficiency gains from TTL and the transfer learning strategy hold potential for reducing training time and computational demand—critical constraints in deploying deep learning models at scale. Theoretically, the model aligns with ongoing research into deepening neural network capacities to enhance temporal sensitivity in dynamic video contexts.

Future developments in this area may involve exploring further architectural enhancements to allow more granular temporal modeling and extending the TTL concept into other domains where temporal dynamics are crucial. Additionally, integrating multimodal data to augment video input and leveraging unsupervised learning methods for even greater efficiency may further enhance the impact and applicability of the T3D framework. The release of the T3D codebase also invites further exploration and experimentation by the wider research community, potentially sparking further innovations in this rapidly evolving field.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ali Diba (17 papers)
Mohsen Fayyaz (31 papers)
Vivek Sharma (54 papers)
Amir Hossein Karami (1 paper)
Mohammad Mahdi Arzani (1 paper)
Rahman Yousefzadeh (2 papers)
Luc Van Gool (570 papers)

Citations (233)

View on Semantic Scholar

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification (1711.08200v1)

Insights into Temporal 3D ConvNets for Video Classification

Related Papers