Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors (1505.04868v1)

Published 19 May 2015 in cs.CV

Abstract: Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features. Specifically, we utilize deep architectures to learn discriminative convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional features into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to transform convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMDB51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features and deep-learned features. Our method also achieves superior performance to the state of the art on these datasets (HMDB51 65.9%, UCF101 91.5%).

Citations (1,146)

View on Semantic Scholar

Summary

The paper presents TDD, a novel approach that combines trajectory-constrained pooling of deep features with improved dense trajectories to enhance action recognition.
It leverages multi-scale convolutional feature maps and normalization techniques to robustly capture the spatiotemporal dynamics of video data.
Experimental results on HMDB51 and UCF101 show state-of-the-art performance, with further gains when TDD is coupled with traditional hand-crafted features.

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

The paper presents a novel approach to video representation for action recognition, termed Trajectory-Pooled Deep-Convolutional Descriptor (TDD). This methodology synergistically combines elements from both hand-crafted and deep-learned visual features to enhance action recognition accuracy. By leveraging the strengths of improved dense trajectories and two-stream Convolutional Networks (ConvNets), the authors aim to bridge the gap between traditional methods and modern deep learning techniques.

Key Components and Methodology

The core innovation of TDD lies in its unique approach to aggregating convolutional features via trajectory-constrained sampling and pooling. The main contributions can be summarized as:

Integration of Hand-Crafted and Deep-Learned Features:
- TDD utilizes deep architectures to extract discriminative convolutional feature maps.
- Improved trajectories are employed to extract point trajectories, which inherently capture the temporal continuity of actions.
- By pooling convolutional responses along these trajectories, TDD effectively combines spatial and temporal information.
Normalization Techniques:
- Two normalization methods, spatiotemporal normalization and channel normalization, are introduced to transform convolutional feature maps, ensuring robust and consistent feature representations.
Multi-Scale Extension:
- TDD computes multi-scale convolutional feature maps for each video, thereby allowing the extraction of more comprehensive and invariant descriptors.

Experimental Validation

The paper's experimental validation is carried out on two extensively used action recognition datasets: HMDB51 and UCF101. Here are the significant outcomes and observations:

Performance on HMDB51 and UCF101:
- TDD achieves state-of-the-art performance, with a recognition accuracy of 63.2% on HMDB51 and 90.3% on UCF101.
- When combined with Improved Dense Trajectories (iDT), the results are further enhanced to 65.9% on HMDB51 and 91.5% on UCF101.
Layer-wise Performance:
- Different convolutional layers from the spatial and temporal ConvNets were evaluated, with conv4 and conv5 layers from spatial nets and conv3 and conv4 from temporal nets yielding the highest performance.
Complementarity with Hand-Crafted Features:
- The combination of TDD with iDT features demonstrates a notable performance boost, indicating that TDD effectively captures complementary information to traditional hand-crafted features.

Implications and Future Directions

The implications of TDD are multi-faceted, spanning both practical and theoretical realms. Practically, the TDD framework can be directly applied to enhance existing action recognition systems in various applications such as video surveillance, human-computer interaction, and automated video content analysis. Theoretically, the concept of trajectory-constrained sampling and pooling can inspire further research into integrating temporal dynamics within deep learning frameworks.

Future developments might focus on:

Refinement of Convolutional Features:
- Employing more advanced deep learning architectures (e.g., 3D ConvNets or Transformers).
- Enhancing the representation of motion by utilizing warped optical flow for temporal TDDs.
Scaling to Larger Datasets:
- Training on larger and more diverse datasets could improve generalization and robustness.
Real-time Implementation:
- Optimizing the computational efficiency to enable real-time action recognition applications.

In conclusion, the TDD framework represents a substantial advancement in the domain of action recognition, offering a robust and discriminative approach that effectively integrates and enhances the capabilities of both hand-crafted and deep-learned features.