A robust and efficient video representation for action recognition (1504.05524v1)

Published 21 Apr 2015 in cs.CV

Abstract: This paper introduces a state-of-the-art video representation and applies it to efficient action recognition and detection. We first propose to improve the popular dense trajectory features by explicit camera motion estimation. More specifically, we extract feature point matches between frames using SURF descriptors and dense optical flow. The matches are used to estimate a homography with RANSAC. To improve the robustness of homography estimation, a human detector is employed to remove outlier matches from the human body as human motion is not constrained by the camera. Trajectories consistent with the homography are considered as due to camera motion, and thus removed. We also use the homography to cancel out camera motion from the optical flow. This results in significant improvement on motion-based HOF and MBH descriptors. We further explore the recent Fisher vector as an alternative feature encoding approach to the standard bag-of-words histogram, and consider different ways to include spatial layout information in these encodings. We present a large and varied set of evaluations, considering (i) classification of short basic actions on six datasets, (ii) localization of such actions in feature-length movies, and (iii) large-scale recognition of complex events. We find that our improved trajectory features significantly outperform previous dense trajectories, and that Fisher vectors are superior to bag-of-words encodings for video recognition tasks. In all three tasks, we show substantial improvements over the state-of-the-art results.

Authors (4)

Heng Wang (136 papers)
Dan Oneata (24 papers)
Jakob Verbeek (59 papers)
Cordelia Schmid (206 papers)

Citations (326)

View on Semantic Scholar

Summary

Robust and Efficient Video Representation for Action Recognition

The paper under review introduces a novel approach to video representation that enhances the recognition and detection of actions within video streams. The authors present improvements to the dense trajectory features, which are widely utilized in action recognition tasks, proposing methodologies to elevate their robustness and efficiency. The paper is centered around two critical innovations: the explicit estimation of camera motion and the employment of Fisher vectors (FVs) as an alternative to the traditional bag-of-words (BOW) for feature encoding.

Key Advancements

Camera Motion Compensation: The authors improve dense trajectory features by explicitly estimating and compensating for camera motion, which is known to introduce noise into motion-based descriptors. They use homography estimation, aligning the optical flow of consecutive frames and removing trajectories that are consistent with estimated camera motion. This process reduces background noise, improving the reliability of motion descriptors such as histograms of oriented gradients (HOG), histograms of optical flow (HOF), and motion boundary histograms (MBH).
Use of Fisher Vectors: The paper demonstrates the superior performance of Fisher vectors in encoding trajectory-based features. Unlike BOW, which only captures the distribution of feature occurrences, FVs encode both the first-order and second-order statistics of features, offering a richer representation. The paper finds that FVs significantly outperform BOW, achieving better accuracy with a smaller number of visual words, thus making the implementation more efficient.

Methodological Insights

Spatial-Temporal Descriptors: By leveraging derived motion features from the stabilized optical flow, the authors improved the performance of descriptors like MBH, resulting in increased robustness to camera motion.
Feature Encoding: The paper systematically evaluates BOW and FVs across different parameters, determining that FVs, despite requiring fewer visual words, consistently outperform BOW across various datasets.
Action Recognition: The research assesses the advancements on action recognition datasets like Hollywood2 and HMDB51. The improved trajectory features significantly surpassed existing results, emphasizing the efficacy of the proposed methods.
Action Localization: For action localization in videos, innovative non-maxima suppression techniques were applied, including a re-scoring mechanism that favors longer detection windows, thereby balancing the temporal extent of actions detected with their scores.

Implications and Speculations

The implications of this paper are both practical and theoretical. Practically, the work leads to more accurate and computationally efficient action recognition systems, particularly in dynamic environments where camera movement is a factor. Theoretically, it enriches the field's understanding of how motion compensation and advanced feature encoding can be synergized to enhance video representation.

The authors' choice to release code for improved trajectory computation aids in replicating and further exploring these results. Future research could expand upon this foundation by exploring deeper neural architectures for feature encoding or considering real-time applications of these advancements in video surveillance or multimedia retrieval.

Conclusion

Through meticulous experimentation and innovative strategy, this paper establishes a new benchmark in video action recognition. The methods proposed offer compelling improvements in both accuracy and robustness, contributing significantly to the ongoing development of intelligent video analysis systems.

PDF Markdown