2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning (1802.09232v2)

Published 26 Feb 2018 in cs.CV

Abstract: Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature. In this work, we propose a multitask framework for jointly 2D and 3D pose estimation from still images and human action recognition from video sequences. We show that a single architecture can be used to solve the two problems in an efficient way and still achieves state-of-the-art results. Additionally, we demonstrate that optimization from end-to-end leads to significantly higher accuracy than separated learning. The proposed architecture can be trained with data from different categories simultaneously in a seamlessly way. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU) demonstrate the effectiveness of our method on the targeted tasks.

Citations (465)

View on Semantic Scholar

Summary

The paper presents a unified multitask framework that jointly tackles 2D/3D pose estimation and action recognition using an end-to-end trainable deep learning model.
The paper leverages a differentiable soft-argmax for robust 3D pose estimation, achieving competitive results on MPII, Human3.6M, Penn Action, and NTU RGB+D datasets.
The paper demonstrates how integrating pose estimation with action recognition can enhance learning efficiency and open new avenues for integrated computer vision systems.

Overview of the Paper "2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning"

The paper presents a multitask deep learning framework for addressing two interconnected problems in computer vision: human pose estimation and action recognition. The proposed framework aims to solve these tasks jointly by leveraging a unified architecture that can handle both 2D and 3D pose estimation from images and action recognition from video sequences.

Joint Pose Estimation and Action Recognition

Traditionally, human pose estimation and action recognition have been approached as separate tasks. However, the authors argue for a joint solution, proposing an end-to-end trainable architecture that can learn both 2D/3D human poses and actions simultaneously. The architecture optimizes pose and action recognition tasks collectively, taking advantage of related tasks to improve overall performance.

Key Contributions

Efficient Multitask Learning: The architecture unifies pose estimation and action recognition, handling both 2D and 3D poses with the same model. This multitask framework leads to more efficient learning and utilization of data.
Soft-argmax for Differentiable Learning: By extending the differentiable Soft-argmax function to support 3D pose estimation, the architecture circumvents the typical drawbacks of non-differentiable post-processing steps like argmax, enabling effective end-to-end training.
State-of-the-Art Results: The model achieves competitive or superior results compared to existing methods across multiple datasets, including MPII, Human3.6M, Penn Action, and NTU RGB+D, in both pose estimation and action recognition tasks.
Robust Training with Mixed Datasets: The ability to train with mixed datasets, including both 2D and 3D annotations, highlights the model's flexibility and robustness in learning diverse visual features.

Experimental Evaluation

The framework was extensively tested on several standard datasets:

Pose Estimation: The model was evaluated on MPII for 2D pose estimation and Human3.6M for 3D pose estimation. Results showed that the proposed approach achieved state-of-the-art scores, particularly notable in regression methods.
Action Recognition: For 2D action recognition, the system was tested on the Penn Action dataset, and for 3D action recognition, on NTU RGB+D. The model exhibited high accuracy rates, underscoring its ability to leverage pose information effectively for recognizing actions.

Implications and Future Work

The proposed multitask approach offers significant implications for future developments in AI and computer vision. By demonstrating the effectiveness of joint training for related tasks, the research provides a pathway for further exploration into more integrated models that can potentially improve efficiency and accuracy across multiple domains. The use of mixed datasets also presents opportunities for more generalized models that perform robustly in varied real-world scenarios.

Looking ahead, exploring further optimizations in architecture design and extending this approach to other related tasks (e.g., scene understanding) could present fruitful research avenues. Moreover, enhancing the generalization capabilities by incorporating more diverse and challenging datasets could further solidify the framework's applicability.

In summary, this paper contributes a robust and efficient multitask framework for jointly addressing pose estimation and action recognition, showcasing impressive results and setting a foundation for further research in integrated deep learning models.

PDF Markdown

Related Papers

YouTube

Show All Videos