Resource Efficient 3D Convolutional Neural Networks (1904.02422v5)

Published 4 Apr 2019 in cs.CV

Abstract: Recently, convolutional neural networks with 3D kernels (3D CNNs) have been very popular in computer vision community as a result of their superior ability of extracting spatio-temporal features within video frames compared to 2D CNNs. Although there has been great advances recently to build resource efficient 2D CNN architectures considering memory and power budget, there is hardly any similar resource efficient architectures for 3D CNNs. In this paper, we have converted various well-known resource efficient 2D CNNs to 3D CNNs and evaluated their performance on three major benchmarks in terms of classification accuracy for different complexity levels. We have experimented on (1) Kinetics-600 dataset to inspect their capacity to learn, (2) Jester dataset to inspect their ability to capture motion patterns, and (3) UCF-101 to inspect the applicability of transfer learning. We have evaluated the run-time performance of each model on a single Titan XP GPU and a Jetson TX2 embedded system. The results of this study show that these models can be utilized for different types of real-world applications since they provide real-time performance with considerable accuracies and memory usage. Our analysis on different complexity levels shows that the resource efficient 3D CNNs should not be designed too shallow or narrow in order to save complexity. The codes and pretrained models used in this work are publicly available.

PDF Abstract

Resource Efficient 3D Convolutional Neural Networks

The paper "Resource Efficient 3D Convolutional Neural Networks" investigates the transition from 2D to 3D convolutional neural network (CNN) architectures with a focus on resource efficiency, addressing a gap in existing research. Despite the increased attention towards 3D CNNs due to their superior spatio-temporal feature extraction capabilities, most benchmarks aim for higher accuracies without considering resource constraints. This paper aims to fill the void by evaluating the conversion of existing resource-efficient 2D models to 3D.

Introduction and Methodology

3D CNNs are increasingly popular for video recognition tasks owing to large video datasets like Sports-1M and Kinetics-400, but these architectures tend to be computationally intensive and require a large number of parameters. Common strategies for resource efficiency in 2D CNNs, such as those employed by SqueezeNet, MobileNet, and ShuffleNet, revolve around reducing the size and complexity of models, making them suitable for real-world applications needing rapid processing on limited hardware platforms.

The authors introduce 3D versions of these resource-efficient architectures, intending to address the deficits in currently available methods that either heavily rely on optical flow calculations or present high computational demands. The paper evaluates these converted models on Kinetics-600 for learning capacity, Jester for motion pattern recognition, and UCF-101 to test transfer learning applicability. Each model's runtime performance is examined on NVIDIA's Titan XP and Jetson TX2 platforms, considering factors like FLOPs and parameter count.

Results

The converted 3D architectures demonstrate variability in performance across the different benchmarks and complexity levels. Deep architectures such as 3D-MobileNetV2 outperformed shallower models like 3D-SqueezeNet, especially in motion pattern recognition tasks on the Jester dataset, indicating the advantage of depthwise convolutions in dynamic tasks. Models performed similarly on Kinetics-600 and UCF-101, suggesting effective transfer learning capabilities.

Interestingly, runtime performance did not always directly correlate with FLOPs or parameter count, underscoring the importance of factors like memory access cost and parallelism, not accounted for by solely examining FLOPs. Models like 3D-SqueezeNet showed high runtime efficiency, attributable to optimization in standard CUDNN operations.

Implications and Future Directions

These findings highlight that resource-efficient 3D CNNs should maintain adequate complexity and depth to prevent significant performance drops, and architectural choices should be informed by specific task requirements. Depthwise convolution layers show promise in tasks requiring motion capture, emphasizing the importance of choosing architectural components methodologically.

Future research in AI could focus on more comprehensive optimization frameworks, enhancing runtime performance across platforms while minimizing FLOPs and power usage. As preprocessing capabilities evolve, the potential for deploying these models in real-world settings—from mobile applications to embedded systems—will undoubtedly increase. Optimizing neural network architectures for specific hardware and task combinations will remain a critical area of development.

Overall, this paper provides a valuable framework for advancing research on 3D CNNs, advocating for a balanced approach between accuracy and resource efficiency. By publicly sharing code and pretrained models, the authors also encourage continuity and collaboration in this domain.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Okan Köpüklü (18 papers)
Neslihan Kose (8 papers)
Ahmet Gunduz (22 papers)
Gerhard Rigoll (49 papers)

Citations (172)

View on Semantic Scholar

Resource Efficient 3D Convolutional Neural Networks (1904.02422v5)