Cross-view Action Modeling, Learning and Recognition (1405.2941v1)

Published 12 May 2014 in cs.CV

Abstract: Existing methods on video-based action recognition are generally view-dependent, i.e., performing recognition from the same views seen in the training data. We present a novel multiview spatio-temporal AND-OR graph (MST-AOG) representation for cross-view action recognition, i.e., the recognition is performed on the video from an unknown and unseen view. As a compositional model, MST-AOG compactly represents the hierarchical combinatorial structures of cross-view actions by explicitly modeling the geometry, appearance and motion variations. This paper proposes effective methods to learn the structure and parameters of MST-AOG. The inference based on MST-AOG enables action recognition from novel views. The training of MST-AOG takes advantage of the 3D human skeleton data obtained from Kinect cameras to avoid annotating enormous multi-view video frames, which is error-prone and time-consuming, but the recognition does not need 3D information and is based on 2D video input. A new Multiview Action3D dataset has been created and will be released. Extensive experiments have demonstrated that this new action representation significantly improves the accuracy and robustness for cross-view action recognition on 2D videos.

Authors (5)

Xiaohan Nie (6 papers)
Yin Xia (31 papers)
Ying Wu (134 papers)
Song-Chun Zhu (216 papers)
Jiang Wang (50 papers)

Citations (493)

View on Semantic Scholar

Summary

The paper introduces MST-AOG, a hierarchical model that robustly recognizes actions from unknown views.
It leverages 3D skeleton data projected to 2D, integrating geometry, appearance, and motion for comprehensive action representation.
Experiments on the Multiview Action3D dataset demonstrate superior accuracy and robustness compared to state-of-the-art methods.

Cross-view Action Modeling, Learning and Recognition

The paper "Cross-view Action Modeling, Learning and Recognition" by Jiang Wang et al. presents an innovative approach to video-based action recognition through a novel framework called the multiview spatio-temporal AND-OR graph (MST-AOG). This framework addresses the challenges of recognizing actions across varying, unseen viewpoints, a limitation in traditional video-based action recognition methods.

Overview of MST-AOG

The MST-AOG model is a generative, compositional representation designed to handle the complexities of cross-view action recognition. Unlike existing methods which generally rely on view-dependent feature extraction, the MST-AOG model performs recognition from unknown and unseen views without explicitly requiring 3D input during inference.

Key characteristics of the MST-AOG include:

Hierarchical Structure: The model encapsulates actions in a hierarchical manner through nodes representing actions, poses, views, body parts, and features. It facilitates flexibility in capturing diverse spatio-temporal patterns of actions.
Geometry, Appearance, and Motion Modeling: The model integrates geometry, appearance, and motion features to create a holistic representation of actions. The inclusion of these three aspects provides robustness against view variations.
3D to 2D Projection: The training process leverages 3D skeleton data from Kinect cameras to model geometrical relations, which are projected onto 2D views. This approach eliminates the need for extensive manual annotations required by traditional methods.

Training and Inference

Training: By utilizing 3D skeleton information, the model learns to interpolate viewpoints, creating a comprehensive structure that accounts for various orientations. A discriminative data mining technique is employed to identify and learn frequent and significant poses, leading to efficient part sharing across different views.
Inference: The inference process involves using dynamic programming for efficient action classification and pose detection. It employs a spatio-temporal pyramid to maximize action prediction accuracy.

Experimental Evaluation

The authors present a new dataset, the Multiview Action3D dataset, demonstrating the efficacy of their approach through rigorous experimentation across different settings—cross-subject, cross-view, and cross-environment. The MST-AOG consistently outperforms state-of-the-art methods in recognition accuracy and robustness, notably in scenarios with novel viewpoints.

Implications and Future Directions

The MST-AOG's ability to adapt to diverse viewpoints offers significant advancements in the field of action recognition. It reduces the dependency on extensive data annotation and introduces a structured methodology to recognize complex actions more accurately across varying conditions.

Future research could explore expanding the model's applications to include diverse real-world scenarios and integrating machine learning advancements such as deep learning architectures to further enhance recognition capabilities.

In conclusion, the MST-AOG model presents a substantial contribution to the field, providing a robust framework for cross-view action recognition that balances compositional flexibility with computational efficiency.

PDF Markdown