Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification (1712.04851v2)

Published 13 Dec 2017 in cs.CV

Abstract: Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24).

PDF Abstract

Speed-Accuracy Trade-offs in Video Classification: A Reappraisal of Spatiotemporal Feature Learning

The paper "Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification" addresses the computational complexities and efficacy of spatiotemporal feature learning in video classification tasks. Despite advancements driven by Convolutional Neural Networks (CNNs) in image classification, video classification presents unique challenges that hinder comparable progress. This paper primarily targets the balance between computational efficiency and classification accuracy through a refined network design.

Challenges in Spatiotemporal Feature Learning

The paper identifies three core challenges in video classification:

Spatial feature representation.
Temporal information representation.
Model and computational complexity.

The widely referenced I3D model introduced by Carreira and Zisserman, which inflates 2D convolutional layers into 3D, demonstrates potential in both spatial and temporal domains but remains computationally expensive and prone to overfitting.

Experimental Setup and Methodology

The authors systematically explore various architectural modifications of the I3D model, leading to significant empirical insights. Their approach incorporates:

Bottom-Heavy and Top-Heavy Models: By altering the placement of 2D and 3D convolutions, the paper introduces Bottom-Heavy I3D (3D convolutions at lower layers) and Top-Heavy I3D (3D convolutions at higher layers) models. The Top-Heavy I3D models achieved superior speed and accuracy, contradicting the expected significance of low-level motion cues.
Separable 3D Convolutions (S3D): Instead of merging spatial and temporal convolutions, the authors propose factorizing them into separate convolutions, significantly reducing computational load without sacrificing accuracy. This design leads to the S3D model, which demonstrated better performance and efficiency than the original I3D model.
Feature Gating (S3D-G): Incorporating a spatio-temporal feature gating mechanism further enhances performance by effectively capturing interdependencies among feature channels.

The holistic approach combines these strategies resulting in the S3D-G model, an architecture that optimizes the trade-off between accuracy and computational efficiency.

Results

The modified architectures were evaluated on critical datasets such as Kinetics, Something-something, UCF101, and HMDB51. Key findings include:

Performance Gains: The S3D and S3D-G models achieved higher accuracy with significantly reduced computational costs compared to I3D. For example, on the Kinetics-Full validation set, S3D-G outperformed I3D with a top-1 accuracy of 74.7% vs. 71.1%, while using fewer parameters (11.56M vs. 12.06M) and lower FLOPS (71.38G vs. 107.89G).
Transfer Learning: S3D-G demonstrated strong performance when transferred to other datasets, such as UCF101 and HMDB51, achieving competitive results even against models pretrained on much larger datasets.
Action Detection: In action detection tasks on JHMDB and UCF-101-24 datasets, S3D-G combined with Faster RCNN outperformed existing state-of-the-art methods, showcasing its applicability beyond classification.

Implications and Future Directions

The findings underscore the effectiveness of top-heavy and separable convolution designs in balancing computational efficiency and classification prowess. The integration of feature gating mechanisms further underscores the potential scalability and generalized application of these architectures.

Future research may investigate optimizing these models for real-time applications, adapting them to other modalities such as depth maps or LiDAR data, or extending their applicability to dense prediction tasks like semantic segmentation in videos. Additionally, exploration into hardware-accelerated implementations could further reduce computational demands, facilitating deployment in resource-constrained environments.

In conclusion, the paper presents a methodologically rigorous re-examination of spatiotemporal feature learning, contributing significant advancements in efficient video classification models. It addresses critical aspects of speed and accuracy trade-offs, presenting robust architectures adaptable to future AI developments in video analysis.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Saining Xie (60 papers)
Chen Sun (187 papers)
Jonathan Huang (46 papers)
Zhuowen Tu (80 papers)
Kevin Murphy (87 papers)

Citations (1,268)

View on Semantic Scholar