Interpretable 3D Human Action Analysis with Temporal Convolutional Networks

Published 14 Apr 2017 in cs.CV | (1704.04516v1)

Abstract: The discriminative power of modern deep learning models for 3D human action recognition is growing ever so potent. In conjunction with the recent resurgence of 3D human action representation with 3D skeletons, the quality and the pace of recent progress have been significant. However, the inner workings of state-of-the-art learning based methods in 3D human action recognition still remain mostly black-box. In this work, we propose to use a new class of models known as Temporal Convolutional Neural Networks (TCN) for 3D human action recognition. Compared to popular LSTM-based Recurrent Neural Network models, given interpretable input such as 3D skeletons, TCN provides us a way to explicitly learn readily interpretable spatio-temporal representations for 3D human action recognition. We provide our strategy in re-designing the TCN with interpretability in mind and how such characteristics of the model is leveraged to construct a powerful 3D activity recognition method. Through this work, we wish to take a step towards a spatio-temporal model that is easier to understand, explain and interpret. The resulting model, Res-TCN, achieves state-of-the-art results on the largest 3D human action recognition dataset, NTU-RGBD.

Abstract PDF Upgrade to Chat

Citations (585)

View on Semantic Scholar

Summary

The paper presents a redesigned TCN architecture with residual connections that directly maps model parameters to explainable human actions.
It visualizes convolutional filters to link specific 3D skeleton movements to the decision process, enhancing transparency over traditional black-box methods.
It achieves competitive performance on the NTU-RGBD dataset with accuracies of 74.3% and 83.1% in Cross-Subject and Cross-View settings, respectively.

Interpretable 3D Human Action Analysis with Temporal Convolutional Networks

The paper "Interpretable 3D Human Action Analysis with Temporal Convolutional Networks" explores the use of Temporal Convolutional Networks (TCN) for 3D human action recognition, building upon the recent progress in leveraging 3D skeleton data for this task. A significant contribution of this work is its focus on model interpretability, addressing the often black-box nature of state-of-the-art learning-based methods.

Methodology

The authors propose a redesigned TCN model, referred to as Res-TCN, which incorporates residual connections to improve interpretability. The design allows for direct mapping of model parameters to human-understandable actions, notably providing greater visibility into the decision-making process compared to LSTM-based RNNs traditionally used in this domain. The TCN framework processes input sequences of 3D skeleton features, where each frame-wise feature is a concatenation of joint positions in euclidean space.

Key Contributions

Model Design: The Res-TCN architecture is specifically constructed to yield interpretable hidden representations. By factoring deeper layers into residual components, the model can distinctly show how each component contributes to the final decision.
Visualization: The paper offers a novel perspective on understanding learned filters. The interpretability is achieved by linking convolutional filter parameters from deeper layers to explainable movements and positions within the skeleton data.
Practical Evaluation: Res-TCN achieves state-of-the-art results on the NTU-RGBD dataset, the largest available dataset for 3D human action recognition, showcasing its competitive performance along with improved interpretability.

Numerical Results

The Res-TCN model outperforms existing methods in both Cross-Subject and Cross-View settings with accuracy rates of 74.3% and 83.1%, respectively. This demonstrates a notable improvement over previous approaches, such as STA-LSTM, which reported accuracies of 73.4% and 81.2%.

Implications and Future Directions

The ability to interpret the model parameters and understand the underlying representations has profound implications for the deployment of deep learning models in sensitive applications where validation and trust are paramount. The approach taken in this paper enhances model transparency and could facilitate broader acceptance in areas requiring explainable AI.

For future work, the methodology can be expanded to explore other forms of interpretable input data. Additionally, adopting similar architecture designs for other complex temporal tasks could lead to advancements in domains such as autonomous driving and robotics.

Overall, this paper stands as a substantive contribution to the field of 3D human action recognition, highlighting a path towards harmonizing performance with interpretability. As AI systems increasingly integrate into various facets of society, the importance of such research cannot be overstated, potentially steering future developments in more transparent and accountable AI systems.

Markdown