Compressed Video Action Recognition (1712.00636v2)

Published 2 Dec 2017 in cs.CV

Abstract: Training robust deep video representations has proven to be much more challenging than learning deep image representations. This is in part due to the enormous size of raw video streams and the high temporal redundancy; the true and interesting signal is often drowned in too much irrelevant data. Motivated by that the superfluous information can be reduced by up to two orders of magnitude by video compression (using H.264, HEVC, etc.), we propose to train a deep network directly on the compressed video. This representation has a higher information density, and we found the training to be easier. In addition, the signals in a compressed video provide free, albeit noisy, motion information. We propose novel techniques to use them effectively. Our approach is about 4.6 times faster than Res3D and 2.7 times faster than ResNet-152. On the task of action recognition, our approach outperforms all the other methods on the UCF-101, HMDB-51, and Charades dataset.

PDF Abstract

Compressed Video Action Recognition

The paper "Compressed Video Action Recognition" presents a novel approach to efficiently recognizing actions in video content by leveraging the inherent structure of compressed video formats. The methodology focuses on training deep neural networks specifically on compressed video data like motion vectors and residuals, rather than the raw RGB frames that are conventionally used in video analysis. This work aims to address two core challenges in video action recognition: the overwhelming data size of raw videos and the high temporal redundancy that can obfuscate meaningful signals in the video content.

Core Approach and Methodology

The authors propose a method that directly operates on the compressed video stream, specifically utilizing motion vectors and residuals from codecs like H.264. These components of compressed video inherently possess a higher information density and provide motion information that is typically absent in static RGB frames. Through this approach, they exploit the inherent temporal patterns captured by video compression, which selectively preserves changes across consecutive frames.

A significant technical contribution of the paper is the innovative use of accumulated motion vectors and residuals, which decouples the dependencies between P-frames (predictive frames) that traditionally exist in compressed video formats. By adopting a back-tracing technique, these authors enable each P-frame to be independently processed relative to I-frames (intra-coded frames), thereby simplifying the computation flow and enhancing the robustness of the features extracted from video data.

The proposed methodology deploys multiple convolutional neural networks (CNNs) tailored to absorb information from motion vectors, residuals, and occasional I-frames. The model thus capitalizes on the intrinsic changes in consecutive frames, rather than redundantly processing uninterested static material. Moreover, the integration of accumulated signals aids in subtracting noise and mundane patterns often arising from camera movements or slight lighting variations.

Numerical Results and Performance

In terms of computational efficiency, the proposed approach significantly reduces the processing time for video action recognition. The paper reports that their method is approximately 4.6 times faster than the Res3D model and 2.7 times faster than ResNet-152. These efficiency gains are attributed to the lighter models used for processing the motion vectors and residuals, and the reduced burden of operations typically incurred by processing raw frames.

From an accuracy standpoint, the method achieves superior performance compared to traditional RGB-based methods and several state-of-the-art systems, such as 3D CNN architectures (e.g., Res3D, C3D) on benchmark datasets including UCF-101, HMDB-51, and Charades. The empirical evaluation underscores the claim that compressed video holds potent utility for enhancing action recognition models, recording higher accuracy rates when combined with optical flow techniques.

Implications and Future Directions

Practically, this methodology could have substantial implications in areas where video data is inherently stored or transmitted in a compressed format, such as online video platforms, surveillance systems, and real-time video analytics. This approach aligns with real-world needs to process videos in their native compressed form, thereby eschewing the costly and sometimes infeasible requirement to decode them into raw RGB formats for analysis.

Theoretically, this framework suggests a reevaluation of how video signals are conventionally treated within the computer vision domain, proposing a paradigm shift from static frame-based analysis to signal-specific learning from compressed content. It opens avenues for further research in handling other compressed data formats and extending similar methodologies to enhance video content understanding tasks such as detection, segmentation, and anomaly recognition.

Moving forward, advancements could potentially incorporate additional elements of video compression, including B-frames, which were not the focus of this paper, but might provide complementary insights or efficiencies. Moreover, exploring the applicability of this approach across various learning frameworks, including unsupervised learning or zero-shot learning in video, could be valuable.

In conclusion, the work on compressed video action recognition embodies a significant step towards efficient video analysis, showing promising results that challenge previous benchmarks while also reducing the computational overhead in processing large scale video data.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Chao-Yuan Wu (19 papers)
Manzil Zaheer (89 papers)
Hexiang Hu (48 papers)
R. Manmatha (31 papers)
Alexander J. Smola (33 papers)
Philipp Krähenbühl (55 papers)

Citations (311)

View on Semantic Scholar

Compressed Video Action Recognition (1712.00636v2)

Compressed Video Action Recognition

Core Approach and Methodology

Numerical Results and Performance

Implications and Future Directions

Related Papers