Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition (2003.14111v2)

Published 31 Mar 2020 in cs.CV

Abstract: Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-range joint relationship modeling under multi-scale operators and (2) unobstructed cross-spacetime information flow for capturing complex spatial-temporal dependencies. In this work, we present (1) a simple method to disentangle multi-scale graph convolutions and (2) a unified spatial-temporal graph convolutional operator named G3D. The proposed multi-scale aggregation scheme disentangles the importance of nodes in different neighborhoods for effective long-range modeling. The proposed G3D module leverages dense cross-spacetime edges as skip connections for direct information propagation across the spatial-temporal graph. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.

Citations (723)

View on Semantic Scholar

Summary

The paper introduces a disentangled multi-scale aggregation scheme that effectively captures long-range joint dependencies.
It presents the G3D module, a unified spatial-temporal operator that enables dense cross-spacetime information flow with dilated temporal windows.
Experimental results on NTU RGB+D and Kinetics datasets demonstrate significant accuracy improvements over previous state-of-the-art methods.

Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

The paper "Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition" addresses two critical limitations in existing skeleton-based action recognition methods: the inability to model unbiased long-range joint relationships effectively under multi-scale operators, and the restricted information flow across spacetime which hinders the capture of complex spatial-temporal dependencies. The authors present a method that disentangles multi-scale graph convolutions and introduce a unified spatial-temporal graph convolutional operator named G3D.

Technical Approach

The methodological advancements in this paper center around two main contributions:

Disentangled Multi-Scale Aggregation:
- Biased Weighting Problem: Traditional approaches to multi-scale aggregation in graph convolutional networks (GCNs) suffer from a bias towards closer nodes due to adjacency matrix powering, which diminishes their effectiveness in capturing long-range dependencies. This problem is exacerbated by the inclusion of node self-loops, amplifying the bias through redundant cyclic walks.
- Disentangled Neighborhoods: To mitigate the biased weighting issue, the authors propose a new aggregation scheme that disentangles node features from different neighborhoods. By defining a $k$ -adjacency matrix, the method captures $k$ -hop neighborhood relationships separately, without the redundant dependencies from closer neighborhoods. This approach ensures that multi-scale operators can effectively model long-range relationships.
G3D: Unified Spatial-Temporal Graph Convolution:
- Cross-Spacetime Modeling: Recognizing the limitations of factorized spatial and temporal modeling, the paper introduces the G3D module. This module employs spatial-temporal windows, allowing for dense cross-spacetime edges that act as skip connections. These enable unobstructed information flow and facilitate the capture of complex spatial-temporal dependencies.
- Dilated Temporal Windows: By utilizing dilated temporal windows, the G3D module can efficiently cover larger temporal contexts without a proportional increase in computational complexity. This flexibility supports multi-scale learning directly in the spatial-temporal domain.

Integrating these two methods, the authors develop the MS-G3D, a powerful feature extractor that combines the strengths of disentangled multi-scale aggregation with unified spatial-temporal graph convolutions.

Experimental Validation

The proposed model was evaluated on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400. Across all datasets and evaluation settings, the model demonstrated superior performance compared to state-of-the-art methods:

NTU RGB+D 120: Achieved classification accuracies of 86.9% in the Cross-Subject setting and 88.4% in the Cross-Setup setting, significantly outperforming previous approaches.
NTU RGB+D 60: The model attained accuracies of 91.5% (Cross-Subject) and 96.2% (Cross-View), surpassing the performance of prior state-of-the-art methods including DGNN and GR-GCN.
Kinetics Skeleton 400: Exhibited a Top-1 accuracy of 38.0% and Top-5 accuracy of 60.9%, establishing a new benchmark for skeleton-based action recognition on this dataset.

Implications and Future Directions

The introduction of the MS-G3D model addresses crucial limitations in the domain of skeleton-based action recognition, particularly in effectively capturing long-range and complex spatial-temporal dependencies. This work opens up new avenues for improving feature extraction from graph-structured data, with potential applications extending beyond action recognition to other areas such as human pose estimation and interactive systems.

Future research could explore optimizing the computational efficiency of the MS-G3D model, particularly in handling very large-scale datasets or real-time applications. Additionally, investigating the integration of MS-G3D with other modalities, such as raw RGB video inputs or audio signals, could further enhance action recognition performance.

Conclusion

The paper contributes significantly to the advancement of graph convolutional networks for skeleton-based action recognition. By disentangling multi-scale graph convolutions and unifying spatial-temporal graph convolutions through the G3D module, the authors present a method that excels in modeling complex joint correlations across both spatial and temporal domains. The experimental results substantiate the efficacy of their approach, setting new standards in the field and paving the way for further innovations in graph-based action recognition.

PDF Markdown