Beyond Short Snippets: Deep Networks for Video Classification (1503.08909v2)

Published 31 Mar 2015 in cs.CV

Abstract: Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 72.8%).

Authors (6)

Joe Yue-Hei Ng (8 papers)
Matthew Hausknecht (26 papers)
Sudheendra Vijayanarasimhan (15 papers)
Oriol Vinyals (116 papers)
Rajat Monga (12 papers)
George Toderici (22 papers)

Citations (2,304)

View on Semantic Scholar

Summary

Deep Networks for Video Classification: Beyond Short Snippets

The paper "Beyond Short Snippets: Deep Networks for Video Classification" presents an in-depth exploration of deep neural network architectures designed to extend the capabilities of video classification beyond short video snippets. The authors propose two principal methods to effectively utilize temporal information over longer video sequences: Feature Pooling and Long Short-Term Memory (LSTM) networks.

Overview of Methods

Feature Pooling Networks: These architectures aggregate frame-level CNN features across time. The authors evaluate various pooling strategies, including Max-Pooling, Late Pooling, Slow Pooling, Local Pooling, and Time-Domain Convolution. The Max-Pooling architecture, particularly when applied after the last convolutional layer, exhibited superior performance.
Recurrent Neural Networks: Specifically focusing on LSTMs, these networks process sequences of CNN activations from video frames. LSTMs are adept at learning long-range temporal relationships through their memory cells, offering a robust method to integrate frame-level information over extended time periods.

Key Results

The paper demonstrates significant performance improvements of these methods on two benchmark datasets: Sports-1M and UCF-101.

Sports-1M: The authors achieve best-in-class results with a top-1 video-level accuracy increase from 60.9% (prior work) to 73.1% using LSTMs, and 72.4% using Conv-Pooling architectures processing up to 120 frames. These results underscore the advantage of leveraging long-term temporal dependency in video data.
UCF-101: On this dataset, the authors attain an accuracy of 88.6% by combining Conv-Pooling and LSTM architectures with optical flow, surpassing previous methods including those that utilized handcrafted features.

Implementation Details

The increase in performance is attributed to several key innovations and careful design choices:

Parameter Sharing: Both feature pooling and LSTM architectures share CNN parameters across frames, ensuring parameter efficiency.
Frame-Rate Considerations: Evaluating different frame rates (1 fps, 6 fps, 30 fps) revealed that lower frame rates can provide slightly better results by incorporating more temporal context.
Optical Flow Integration: While optical flow significantly aids performance on tightly controlled datasets like UCF-101, its benefit on unconstrained datasets like Sports-1M is muted due to the latter's noisy and varied nature. LSTM networks can integrate optical flow more effectively in such scenarios.

Implications and Future Directions

This work has several practical and theoretical implications:

Scalability in Video Classification: By demonstrating the effectiveness of processing longer video sequences, this work pushes the boundary of video classification capabilities. The strategies proposed can be adapted to various real-world applications, from surveillance to autonomous driving.
Integration of Temporal Dynamics: The success of LSTM architectures highlights the importance of modeling temporal evolution in video data. This sets a precedent for further exploring recurrent architectures and their variants in video and other time-series classification tasks.
Dataset Characteristics: The differing performance with optical flow on Sports-1M versus UCF-101 suggests that video quality and dataset characteristics significantly influence the choice of models and preprocessing steps. This insight can guide future dataset-specific optimizations in video analysis.

The paper paves the way for future developments such as integrating temporal sequence information deeper into CNN layers or using hybrid models that combine recurrent and convolutional operations seamlessly. These advancements could lead to even more accurate and computationally efficient solutions for video classification tasks.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos