Real-time Action Recognition with Enhanced Motion Vector CNNs (1604.07669v1)

Published 26 Apr 2016 in cs.CV

Abstract: The deep two-stream architecture exhibited excellent performance on video based action recognition. The most computationally expensive step in this approach comes from the calculation of optical flow which prevents it to be real-time. This paper accelerates this architecture by replacing optical flow with motion vector which can be obtained directly from compressed videos without extra calculation. However, motion vector lacks fine structures, and contains noisy and inaccurate motion patterns, leading to the evident degradation of recognition performance. Our key insight for relieving this problem is that optical flow and motion vector are inherent correlated. Transferring the knowledge learned with optical flow CNN to motion vector CNN can significantly boost the performance of the latter. Specifically, we introduce three strategies for this, initialization transfer, supervision transfer and their combination. Experimental results show that our method achieves comparable recognition performance to the state-of-the-art, while our method can process 390.7 frames per second, which is 27 times faster than the original two-stream method.

Citations (412)

View on Semantic Scholar

Summary

The paper introduces teacher initialization and supervision transfer techniques to improve motion vector CNN performance for real-time action recognition.
It replaces computationally expensive optical flow calculations with efficient motion vectors extracted from compressed videos.
Experimental results on UCF101 and THUMOS14 demonstrate a processing speed of 390.7 fps with negligible accuracy loss.

Real-time Action Recognition with Enhanced Motion Vector CNNs

The discussed paper addresses the critical issue of computational efficiency in action recognition from video data, specifically targeting the widely adopted two-stream convolutional neural network (CNN) architecture. The focus of the paper is to transform the computationally expensive step of optical flow calculation, which is a limitation in achieving real-time processing, into a more pragmatic solution by employing motion vectors available from video compression algorithms.

Core Innovations and Methodology

The primary innovation lies in utilizing motion vectors as a replacement for optical flows in the temporal stream of the two-stream CNN framework. Motion vectors are advantageous as they are readily available from compressed video formats and present a significantly lower computational burden. The key challenges with motion vectors include their lack of fine detail and precision when compared to optical flows, potentially leading to degraded recognition performance. The authors propose an effective strategy to mitigate this issue through knowledge transfer from optical flow CNNs (OF-CNN) to motion vector CNNs (MV-CNN).

The paper introduces three knowledge transfer strategies:

Teacher Initialization: Initializing MV-CNN with weights from a pre-trained OF-CNN to leverage the shared characteristics between optical flows and motion vectors.
Supervision Transfer: During MV-CNN training, supervision is provided by outputs from OF-CNN, aligning with techniques akin to knowledge distillation.
Combination of Initialization and Supervision: Employs both strategies to further bolster MV-CNN's learning process by utilizing the robust initialization of weights and continual supervision during training.

Experimental Results

Empirical results denote that the proposed motion vector-based approach achieves comparable performance to optical flow-based methods, with a dramatic enhancement in processing speed. On the UCF101 dataset—a benchmark with pre-defined training and testing splits—experiments reveal that the enhanced motion vector CNN (EMV-CNN) processes video frames at an impressive 390.7 fps, which is a substantial improvement over traditional methods. It demonstrated a negligible reduction in accuracy when compared to the state-of-the-art methods—a testament to the success of the knowledge transfer strategies implemented.

Furthermore, the method was tested on the challenging THUMOS14 dataset, where the EMV-CNN maintained robust performance levels despite the dataset's use of untrimmed videos full of extraneous content—demonstrating good generalization capability and effectiveness of the model.

Discussion and Future Directions

The paper provides a significant contribution to the domain of real-time video-based action recognition by successfully balancing the trade-off between computational efficiency and recognition accuracy. The knowledge transfer techniques present not only elevated the performance of MV-CNNs but did so without the need for extensive additional computations.

From a practical standpoint, the reduction in computational requirements positions this approach as promising for deployment in environments where computational resources are limited, or where real-time processing is of essence, such as surveillance systems or human-computer interaction applications.

For future research, avenues could include expanding these transfer learning strategies to more intricate neural architectures and exploring other types of video compression indicators as potential input streams for models tailored toward more specialized tasks. The pursuit of further improvements in the precision of motion vector analysis may also enrich the capacity of these neural networks to maintain high accuracy while benefitting from reduced processing loads. Moreover, evaluating the scalability and adaptability of these approaches across diversified action recognition datasets can offer a broader validation of the proposed methodologies.

PDF Markdown