Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice (1405.4506v1)

Published 18 May 2014 in cs.CV

Abstract: Video based action recognition is one of the important and challenging problems in computer vision research. Bag of Visual Words model (BoVW) with local features has become the most popular method and obtained the state-of-the-art performance on several realistic datasets, such as the HMDB51, UCF50, and UCF101. BoVW is a general pipeline to construct a global representation from a set of local features, which is mainly composed of five steps: (i) feature extraction, (ii) feature pre-processing, (iii) codebook generation, (iv) feature encoding, and (v) pooling and normalization. Many efforts have been made in each step independently in different scenarios and their effect on action recognition is still unknown. Meanwhile, video data exhibits different views of visual pattern, such as static appearance and motion dynamics. Multiple descriptors are usually extracted to represent these different views. Many feature fusion methods have been developed in other areas and their influence on action recognition has never been investigated before. This paper aims to provide a comprehensive study of all steps in BoVW and different fusion methods, and uncover some good practice to produce a state-of-the-art action recognition system. Specifically, we explore two kinds of local features, ten kinds of encoding methods, eight kinds of pooling and normalization strategies, and three kinds of fusion methods. We conclude that every step is crucial for contributing to the final recognition rate. Furthermore, based on our comprehensive study, we propose a simple yet effective representation, called hybrid representation, by exploring the complementarity of different BoVW frameworks and local descriptors. Using this representation, we obtain the state-of-the-art on the three challenging datasets: HMDB51 (61.1%), UCF50 (92.3%), and UCF101 (87.9%).

Citations (613)

View on Semantic Scholar

Summary

The paper proposes a hybrid BoVW representation that integrates multiple encoding methods, achieving state-of-the-art accuracy on HMDB51, UCF50, and UCF101.
It thoroughly dissects the BoVW pipeline, evaluating five critical components from feature extraction to pooling and normalization to optimize action recognition.
The study finds that representation-level fusion best exploits feature complementarity, guiding future improvements in video-based action recognition systems.

Overview of "Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice"

This paper presents a detailed analysis of the Bag of Visual Words (BoVW) model as applied to video-based action recognition, examining each step of its pipeline and various fusion methods. Authored by Xiaojiang Peng, Limin Wang, Xingxing Wang, and Yu Qiao, this paper evaluates a multitude of strategies to refine action recognition systems, using datasets such as HMDB51, UCF50, and UCF101. The key contribution is the proposal of a hybrid representation that combines different BoVW frameworks and local descriptors to achieve state-of-the-art results.

Pipeline Analysis

The authors dissect the BoVW model into five critical components: feature extraction, feature pre-processing, codebook generation, feature encoding, and pooling and normalization.

Feature Extraction: The paper employs Space Time Interest Points (STIPs) and Improved Dense Trajectories (iDTs) as representative local features, exploring their performance characteristics with various descriptors.
Feature Pre-processing: PCA and whitening are utilized to stabilize feature representation, which significantly enhances encoding performance. This step addresses the high dimensionality and correlation issues in video features.
Codebook Generation: Both $k$ -means clustering and Gaussian Mixture Models (GMMs) are considered for constructing the codebook. The comparative analysis highlights the benefits of soft assignments in capturing local feature distributions.
Feature Encoding: The paper evaluates thirteen encoding methods categorized into voting, reconstruction, and super vector-based methods. Super vector methods, particularly Fisher Vector (FV), demonstrate superior performance due to their ability to capture higher-order statistics.
Pooling and Normalization: Various techniques are assessed, with sum pooling combined with power $\ell_2$ -normalization emerging as optimal. Intra-normalization is also explored for handling feature bursts in densely sampled data.

Fusion Methods

The paper examines fusion strategies at different levels: descriptor, representation, and score. The paper finds representation-level fusion to be generally the most effective. It reveals that this method leverages the complementarity among features better than descriptor-level or score-level fusions, particularly when dealing with dense feature sets like iDTs.

Hybrid Representation

Building on the insights from their evaluations, the authors propose a hybrid representation that utilizes multiple encoding methods and descriptors. This representation integrates the strengths of distinct BoVW models, exploiting their complementarity to enhance recognition accuracy. On datasets HMDB51, UCF50, and UCF101, this approach achieves accuracy rates of 61.1%, 92.3%, and 87.9% respectively, setting new benchmarks.

Implications and Future Directions

The findings underscore the intricacies involved in designing robust action recognition systems and the critical importance of each step in the BoVW framework. The hybrid representation offers a potent baseline for future research, suggesting that ongoing enhancements in encoding strategies and fusion techniques could yield further improvements. The paper’s comprehensive approach provides a strong foundation for further exploration into adaptive and efficient recognition systems, potentially incorporating advances in neural architectures and real-time processing capabilities. The implications for AI and computer vision are significant, particularly in applications requiring fine-grained action detection in complex environments.

PDF Markdown