Exploiting Image-trained CNN Architectures for Unconstrained Video Classification (1503.04144v3)

Published 13 Mar 2015 in cs.CV

Abstract: We conduct an in-depth exploration of different strategies for doing event detection in videos using convolutional neural networks (CNNs) trained for image classification. We study different ways of performing spatial and temporal pooling, feature normalization, choice of CNN layers as well as choice of classifiers. Making judicious choices along these dimensions led to a very significant increase in performance over more naive approaches that have been used till now. We evaluate our approach on the challenging TRECVID MED'14 dataset with two popular CNN architectures pretrained on ImageNet. On this MED'14 dataset, our methods, based entirely on image-trained CNN features, can outperform several state-of-the-art non-CNN models. Our proposed late fusion of CNN- and motion-based features can further increase the mean average precision (mAP) on MED'14 from 34.95% to 38.74%. The fusion approach achieves the state-of-the-art classification performance on the challenging UCF-101 dataset.

Citations (203)

View on Semantic Scholar

Summary

The paper introduces a novel pipeline that adapts image-trained CNNs to video classification through optimal CNN layer selection and pooling techniques.
It demonstrates that hidden layers outperform output layers, with max pooling significantly enhancing performance by filtering irrelevant frames.
The integration of motion features via late fusion with CNN outputs achieves state-of-the-art results on TRECVID MED'14 and UCF-101 datasets.

An Exploration of Image-Trained CNN Architectures for Video Classification

The paper "Exploiting Image-trained CNN Architectures for Unconstrained Video Classification" presents a comprehensive examination of leveraging convolutional neural network (CNN) architectures, originally trained for image classification on the ImageNet dataset, to address the challenges of video classification. The discourse is rich with methodological insights and experimental evaluations that demonstrate the viability of this approach for event detection in videos.

Core Contributions and Methodology

The primary contribution of this paper is the detailed pipeline for adapting image-trained CNNs to video classification tasks. The authors propose several strategies to improve the performance of such networks:

Selecting Optimal CNN Layers: The paper evaluates output and hidden layers' contributions to feature extraction, revealing that hidden layers, which are higher-level image representations, provide better results for CNN-based video classification.
Pooling Techniques: Innovative pooling strategies are investigated to harness spatial and temporal information. Experimentation with average and max pooling methods leads to notable performance gains, with max pooling universally outperforming average pooling due to its ability to disregard irrelevant video frames.
Normalization and Classifiers: Different feature normalization techniques (\ell_1, \ell_2, and root normalization) are paired with a range of classifier options. The results underscore the critical importance of normalization methods tailored to specific feature types, and the authors find that a Gaussian RBF kernel SVM often surpasses linear models in efficacy.
Feature Fusion: The paper makes a compelling argument for integrating motion information through late fusion of CNN and motion-based features, such as Improved Dense Trajectories (IDT) encoded with Fisher vectors. This fusion significantly boosts classification performance, achieving state-of-the-art results on challenging datasets like TRECVID MED'14 and UCF-101.

Experimental Evaluation

The approach is rigorously tested on two prominent datasets: TRECVID MED'14 and UCF-101. For TRECVID MED'14, the method outperforms several state-of-the-art non-CNN models, with CNN-hidden layers showcasing superior classification accuracy and speed. On UCF-101, a dataset concentrating on human action recognition, the constructed pipeline yields significant enhancements in performance, both in standalone CNN evaluations and when fused with motion-based features.

Theoretical Implications and Future Directions

The research highlights the versatility of image-trained CNN architectures in domains beyond static image analysis, particularly by repurposing their layers and augmenting them with additional data modalities like motion. This suggests avenues for further exploration into the integration of spatiotemporal dynamics within CNNs. Moreover, the paper implies that, despite domain mismatches between training and application datasets, robust performance can still be achieved, hinting at potential advantages of fine-tuning and domain adaptation techniques in the future.

Looking ahead, there are considerable opportunities in evolving these architectures. For instance, the authors advocate for fine-tuning CNN models on video-specific datasets to better align them with the distinct characteristics of moving imagery. Furthermore, newer neural architectures incorporating 3D convolutions or long short-term memory (LSTM) units could potentially enhance capture of temporal dependencies, further refining the video classification capabilities.

In conclusion, the paper provides an effective blueprint for transferring insights from image-based learning models to the more complex domain of video. It underscores the practicality of leveraging existing image-trained CNN models supplemented with novel methods and fusion strategies, propelling the field toward deeper integration of diverse data modalities for video analytics.

PDF Markdown