Going Deeper into Action Recognition: A Survey (1605.04988v2)

Published 16 May 2016 in cs.CV

Abstract: Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader.

Citations (591)

View on Semantic Scholar

Summary

The paper presents an extensive review of the evolution from handcrafted features to advanced deep learning approaches in action recognition.
It details how early techniques like Motion Energy Images and trajectories have given way to methods integrating dense trajectories with deep architectures.
The authors highlight challenges and future directions in enhancing model generalization and efficiency for video-based action recognition.

Action Recognition: A Comprehensive Survey of Methodologies

In "Going Deeper into Action Recognition: A Survey," Herath, Harandi, and Porikli provide an extensive review of advancements in action recognition, a domain of computer vision critical for applications ranging from video surveillance to human-computer interaction. The paper navigates through the evolution of methodologies, transitioning from handcrafted feature representations to deep learning paradigms, offering insights into both the progress and current challenges in the field.

Handcrafted Representations

The early stages of action recognition relied heavily on handcrafted features. These techniques included holistic models, such as the Motion Energy and Motion History Images, which focused on capturing motion through 2D representations. As the field progressed, local features became prevalent, exemplified by Space-Time Interest Points (STIPs) and trajectories capturing detailed movement patterns.

Despite their historical significance, handcrafted approaches have been gradually overshadowed by the robustness of algorithms harnessing dense trajectories and Fisher Vector (FV) encodings, capable of achieving remarkable accuracy on challenging datasets. However, these methods sometimes lack the adaptability seen in deep learning models.

Deep Learning Approaches

The shift to deep learning has significantly impacted action recognition. Various architectures such as 3D Convolutional Networks (ConvNets) and Recurrent Neural Networks (RNNs) with Long-Short Term Memory (LSTM) cells have shown promise in capturing spatiotemporal dynamics. The introduction of two-stream networks, which separate spatial and temporal information using RGB and optical flow inputs, has further bolstered performance by mimicking the human visual cortex's processing streams.

Recent advancements leverage 3D ConvNets and temporal pooling strategies to capture fine-grained action details across extended video sequences. The paper highlights how contemporary models often blend these deep architectures with handcrafted features, like trajectories, to enhance recognition accuracy.

Evaluation and Comparison

The authors provide a quantitative analysis across several action recognition datasets, including HMDB-51, UCF-101, and Sports-1M. They observe comparable performances between deep learning approaches and enhanced handcrafted methods, noting the effectiveness of hybrid solutions that combine both strategies. Particularly, the integration of dense trajectories with deep features remains a robust paradigm for achieving state-of-the-art results.

Future Directions

Looking ahead, the paper outlines several challenges and potential research directions. These include improving the generalization of deep models across different datasets, exploiting unsupervised learning for massive video archives, and implementing more efficient architectures that reduce computational demands. Additionally, addressing action localization remains pivotal for real-world applications, as does refining models for fine-grained action classification.

Conclusion

Herath et al.'s survey provides a meticulous examination of the field's journey from handcrafted methodologies to deep learning solutions, emphasizing the intricate balance between spatial and temporal data processing. The paper acts as a catalyst for future research, underscoring the field's dynamic nature and the ongoing quest for models that not only understand actions but do so with remarkable accuracy and efficiency.