A Discriminative CNN Video Representation for Event Detection (1411.4006v1)

Published 14 Nov 2014 in cs.CV

Abstract: In this paper, we propose a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkit. This paper makes two contributions to the inference of CNN video representation. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be significantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8% for the TRECVID MEDTest 14 dataset and from 34.0% to 44.6% for the TRECVID MEDTest 13 dataset. This work is the core part of the winning solution of our CMU-Informedia team in TRECVID MED 2014 competition.

Authors (3)

Zhongwen Xu (33 papers)
Yi Yang (856 papers)
Alexander G. Hauptmann (40 papers)

Citations (446)

View on Semantic Scholar

Summary

Discriminative CNN Video Representation for Event Detection: An In-Depth Analysis

The paper "A Discriminative CNN Video Representation for Event Detection" by Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann presents a method to enhance video event detection using Convolutional Neural Networks (CNNs). This work focuses on efficient video representation in large datasets with constrained hardware resources, leveraging CNNs to surpass previous methodologies such as improved Dense Trajectories in performance.

Core Contributions

The paper proposes two significant advancements in CNN-based video representation:

Advanced Encoding Methods: The authors challenge the conventional aggregation methods like average and max pooling, demonstrating that a more sophisticated encoding technique significantly enhances performance. This finding reshapes the approach towards CNN descriptor aggregation, suggesting that encoding techniques traditionally used for local descriptors are also applicable to CNN-based features at a video level.
Latent Concept Descriptors: By employing a set of latent concept descriptors for frame description, the method enriches visual content without excessive computational demands. These descriptors are extracted at deeper network layers, which aligns with the conceptual hierarchy of CNNs and brain-inspired information processing.

Together, these contributions allow the method to achieve state-of-the-art performance, notably improving Mean Average Precision (mAP) from 27.6% to 36.8% for the TRECVID MEDTest 14 dataset and from 34.0% to 44.6% for the MEDTest 13 dataset.

Methodology and Results

The proposed method involves first extracting CNN descriptors using the Caffe toolkit and then generating video-level representations through advanced pooling techniques such as Fisher vectors and VLAD. Among these, VLAD encoding for CNN descriptors showed superior discriminative ability, with improved performance over average pooling and Fisher vector methods.

This research also explores the potential of CNN latent concept descriptors, implementing Spatial Pyramid Pooling (SPP) for enriching spatial information extracted from the filtered images in convolutional layers. This approach not only garners higher mAP values but also maintains computational efficiency, thus allowing even smaller research entities to process large-scale datasets.

Practical Implications

The integration of CNNs into event detection frameworks offers a pathway to significantly improved detection performance with reduced reliance on heavy computation resources. The method suggests practical implementations for researchers with limited hardware, enabling state-of-the-art performance on single-machine setups.

Moreover, by incorporating Product Quantization for data compression, the paper presents a method to expedite the event search process, making it feasible to conduct large-scale evaluations swiftly and with minimal storage requirements.

Future Directions

Looking forward, advancements in CNN architectures can further boost performance by refining model training techniques such as domain-specific fine-tuning. There's also potential in exploring hybrid models that can integrate diverse CNN architectures and further extend latent concepts into diverse multimedia analysis domains.

In summary, this paper lays foundational work for efficient video event detection using CNNs. It addresses both theoretical underpinnings and practical applications, offering a robust framework for further innovation in artificial intelligence and multimedia analysis fields.

PDF Markdown

Related Papers

Find Related Papers