Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human Action Recognition and Prediction: A Survey (1806.11230v3)

Published 28 Jun 2018 in cs.CV

Abstract: Derived from rapid advances in computer vision and machine learning, video analysis tasks have been moving from inferring the present state to predicting the future state. Vision-based action recognition and prediction from videos are such tasks, where action recognition is to infer human actions (present state) based upon complete action executions, and action prediction to predict human actions (future state) based upon incomplete action executions. These two tasks have become particularly prevalent topics recently because of their explosively emerging real-world applications, such as visual surveillance, autonomous driving vehicle, entertainment, and video retrieval, etc. Many attempts have been devoted in the last a few decades in order to build a robust and effective framework for action recognition and prediction. In this paper, we survey the complete state-of-the-art techniques in action recognition and prediction. Existing models, popular algorithms, technical difficulties, popular action databases, evaluation protocols, and promising future directions are also provided with systematic discussions.

Human Action Recognition and Prediction: A Survey

In the survey titled "Human Action Recognition and Prediction," Yu Kong and Yun Fu present a comprehensive analysis of various methodologies in the domain of human action recognition, with a focus on Temporal Segment Networks (TSNs). The paper outlines the challenges and advancements in extracting long-range temporal dynamics from video data, which is crucial for accurately recognizing complex human actions.

Temporal Segment Networks (TSNs)

The core method discussed involves TSNs, which utilize temporal sampling techniques to mitigate redundancy in video frames. Traditional approaches, often reliant on dense sampling, struggle with excessive computational demands and redundant data. In contrast, TSNs capitalize on randomly sampled frames from different segments of a video. This enables them to capture essential temporal dynamics efficiently, thereby addressing both short-term and long-term action characteristics.

In the original framework, TSNs employ a two-stream architecture derived from preceding research, leveraging pre-trained Deep CNNs for feature extraction. The consensus function—initially a simple pooling operation—is instrumental in summarizing predictions from individual segments.

Subsequent enhancements to the TSN framework, as noted in the paper, include adaptive sampling at varied temporal scales. This approach not only refines feature extraction but also replaces traditional pooling with fully connected networks, enhancing the encoding of frame sequences and improving recognition accuracy. These adaptations enable TSNs to be integrated into broader action recognition frameworks, showcasing flexibility and effectiveness.

Implications and Future Directions

This survey delineates the significant strides made in human action recognition through TSNs, underscoring their capability in handling complex, temporally extended actions. The discussion points toward several implications and avenues for further research:

  1. Practical Implications: The refined video analysis techniques promise significant improvements in performance for real-time applications, including surveillance, human-computer interaction, and video understanding.
  2. Theoretical Implications: Understanding and optimizing temporal dynamics fosters advances in sequential data modeling and neural network architectures.
  3. Future Research: The evolution of TSNs invites exploration into more sophisticated consensus functions and sampling methodologies. Additionally, integrating TSNs with emerging AI paradigms could yield considerable advancements in predictive accuracy and computational efficiency.

In conclusion, the survey by Kong and Fu provides an insightful overview of the current state and future potential of TSNs in human action recognition, reflecting both the depth and breadth of ongoing research in this dynamic field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yu Kong (37 papers)
  2. Yun Fu (131 papers)
Citations (523)