Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition (2208.10741v3)

Published 23 Aug 2022 in cs.CV

Abstract: Graph convolutional networks (GCNs) are the most commonly used methods for skeleton-based action recognition and have achieved remarkable performance. Generating adjacency matrices with semantically meaningful edges is particularly important for this task, but extracting such edges is challenging problem. To solve this, we propose a hierarchically decomposed graph convolutional network (HD-GCN) architecture with a novel hierarchically decomposed graph (HD-Graph). The proposed HD-GCN effectively decomposes every joint node into several sets to extract major structurally adjacent and distant edges, and uses them to construct an HD-Graph containing those edges in the same semantic spaces of a human skeleton. In addition, we introduce an attention-guided hierarchy aggregation (A-HA) module to highlight the dominant hierarchical edge sets of the HD-Graph. Furthermore, we apply a new six-way ensemble method, which uses only joint and bone stream without any motion stream. The proposed model is evaluated and achieves state-of-the-art performance on four large, popular datasets. Finally, we demonstrate the effectiveness of our model with various comparative experiments.

Citations (89)

View on Semantic Scholar

Summary

The paper introduces a novel hierarchically decomposed graph convolution network that captures both adjacent and distant skeletal joint relationships for improved action recognition.
It employs an attention-guided hierarchy aggregation module using RSAP and H-EdgeConv to emphasize critical features across node hierarchies.
Extensive benchmarks demonstrate that the HD-GCN outperforms state-of-the-art models, using only joint and bone stream data for high accuracy.

Overview of Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition

The paper presents a novel architecture for skeleton-based action recognition using hierarchically decomposed graph convolutional networks (HD-GCN). The research addresses the challenges faced by traditional Graph Convolutional Networks (GCNs) in detecting semantically significant edges within human skeleton graphs. Particularly, it focuses on extracting meaningful structural relationships, both adjacent and remote, between human skeletal joints to improve recognition accuracy.

Methodology

The authors propose the HD-Graph, a framework that decomposes graph nodes hierarchically, forming hierarchical node sets. Joining these sets, both structurally adjacent and distant, within semantic spaces enables better human action recognition. The novel connectivity in HD-Graph construction involves more comprehensive engagement between nodes, mitigating long-range dependency limitations evident in conventional methods.

The representation of hierarchy in the HD-Graph fosters the extraction of robust structural relationships by dynamically connecting nodes in adjacent hierarchy levels. A rooted tree structure serves as the basis for establishing these connections while accommodating the variations across different skeletal data. Notably, this method allows the incorporation of both physically connected (PC) and fully connected (FC) edges, augmenting the receptive fields of GCNs significantly.

Complementing the HD-Graph is the attention-guided hierarchy aggregation (A-HA) module, which focuses on emphasizing significant edge sets through representative spatial average pooling (RSAP) and hierarchical edge convolution (H-EdgeConv). These submodules are crucial for discerning dominant hierarchical features and facilitating effective attention allocation across different action recognition tasks.

Results

The authors validate their approach on several major benchmarks, including NTU-RGB+D 60, NTU-RGB+D 120, Kinetics-Skeleton, and Northwestern-UCLA datasets. The HD-GCN consistently outperforms existing state-of-the-art methods across these datasets, as evidenced by achieving superior classification accuracy metrics without relying on motion stream data. Such performances underscore the method's capability in not just localizing, but accentuating, key features across varied hierarchical node sets.

A six-way ensemble method further enhances the robustness of the HD-GCN. Unlike conventional four-stream ensemble methods which incorporate motion data, the proposed ensemble method utilizes only joint and bone stream data. Three different HD-Graphs, each featuring different centers of mass (CoM) nodes, allow for a diversified learning pattern, which improves the model’s accuracy and adaptability.

Implications and Future Directions

The HD-GCN sets a promising direction for future AI applications in human action recognition, human-computer interactions, and surveillance systems by providing an enriched understanding of skeletal semantics. The hierarchical decomposition strategy underscores the importance of integrating advanced architectural considerations in graph-based neural networks, highlighting potential adaptations in other skeletal recognition applications.

Future developments may explore further refinements in HD-Graph's construction, such as adaptive CoM node selection, to enhance computational efficiency and accuracy. Additionally, the expansion of this framework to other modalities like RGB-D data could pave the way for a more holistic approach to action recognition, offering deeper insights into multimodal learning systems.

Overall, this paper marks a substantial step forward in skeleton-based action recognition, combining hierarchical decomposition with advanced attention mechanisms to achieve state-of-the-art results in a computationally efficient manner.

PDF Markdown

Related Papers

GitHub

GitHub - Jho-Yonsei/HD-GCN: [ICCV 2023] Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition (149 stars)