Revisiting Skeleton-based Action Recognition (2104.13586v2)

Published 28 Apr 2021 in cs.CV

Abstract: Human skeleton, as a compact representation of human action, has received increasing attention in recent years. Many skeleton-based action recognition methods adopt graph convolutional networks (GCN) to extract features on top of human skeletons. Despite the positive results shown in previous works, GCN-based methods are subject to limitations in robustness, interoperability, and scalability. In this work, we propose PoseC3D, a new approach to skeleton-based action recognition, which relies on a 3D heatmap stack instead of a graph sequence as the base representation of human skeletons. Compared to GCN-based methods, PoseC3D is more effective in learning spatiotemporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings. Also, PoseC3D can handle multiple-person scenarios without additional computation cost, and its features can be easily integrated with other modalities at early fusion stages, which provides a great design space to further boost the performance. On four challenging datasets, PoseC3D consistently obtains superior performance, when used alone on skeletons and in combination with the RGB modality.

Citations (414)

View on Semantic Scholar

Summary

The paper introduces PoseConv3D, a framework that transforms skeleton data into 3D heatmap volumes to enhance action recognition.
It demonstrates superior accuracy and scalability compared to GCN-based methods, achieving state-of-the-art results on multiple benchmarks.
The effective integration with other modalities paves the way for more versatile, real-world action recognition systems.

An Evaluation of PoseConv3D for Skeleton-Based Action Recognition

The paper "Revisiting Skeleton-based Action Recognition" presents a novel framework, PoseConv3D, designed to enhance skeleton-based action recognition. It addresses key limitations of current Graph Convolutional Network (GCN)-based methods, particularly concerning robustness, interoperability, and scalability. PoseConv3D leverages a 3D heatmap volume as the primary representation of human skeletons, which significantly differs from the graph sequence approach commonly used in GCNs.

Framework and Methodology

PoseConv3D redefines the skeleton-based action recognition methodology by transforming the representation from GCNs to 3D heatmap volumes. These heatmap volumes allow for enhanced spatiotemporal feature learning and are more resilient to pose estimation errors. This is particularly advantageous in cross-dataset scenarios, where the generalization of models is crucial. Unlike GCNs, which suffer from increased computational complexity with additional persons in the frame, PoseConv3D maintains efficiency even in multiple-person scenarios.

The authors provide empirical evidence demonstrating PoseConv3D's superior performance across various skeleton-based action recognition benchmarks. After fusing PoseConv3D with other modalities, it achieves state-of-the-art results on all multi-modality action recognition benchmarks considered. An integral aspect of PoseConv3D is its ability to integrate with other modalities early in the processing pipeline, offering a flexible design space for performance enhancement.

Experimental Outcomes

PoseConv3D exhibits strong numerical outcomes across a variety of benchmarks. It outperformed existing GCN-based methods in both skeleton-based and multi-modality action recognition tasks. Specifically, PoseConv3D achieved leading performance on five out of six skeleton-based benchmarks. In multi-modality fusion, the system demonstrated effectiveness on all eight investigated datasets, underscoring its robustness and generalization capabilities.

The paper also explores the effectiveness of different design choices in the context of pose extraction and representation. It concludes that high-quality 2D pose representations, when processed as 3D heatmap volumes, lead to better recognition performance than traditional 3D reconstruction methods or coordinate-based input formats.

Theoretical and Practical Implications

The transition from GCNs to 3D heatmap volumes for skeleton action recognition represents a substantive methodological shift. By addressing the key drawbacks of GCNs in robustness and scalability, PoseConv3D could potentially alter how computational models for human action recognition are designed in the future. Moreover, the successful integration of pose data with other modalities suggests broader applicability for PoseConv3D across diverse domains needing joint action and contextual understanding.

Speculation on Future Developments

Future developments in action recognition might explore extensions of PoseConv3D into more complex, real-world environments, where various actions and interactions occur. Moreover, the interplay between different modalities beyond RGB and pose data could be explored, potentially involving depth sensors, audio data, or even contextual scene understanding, further capitalizing on the interoperability highlight of PoseConv3D.

In conclusion, the introduction of PoseConv3D marks a significant step forward in skeleton-based action recognition. It effectively utilizes 3D-CNNs to overcome the limitations seen in GCNs, offering a more robust, scalable, and versatile solution for action recognition tasks. This work lays the groundwork for further innovations, possibly leading to systems that are not only more accurate but also more adaptable to varied and complex datasets.

PDF Markdown