Global Skeleton-relational Transformer (GST)

Updated 3 December 2025

GST is a transformer-based architecture that models structured spatial and temporal dependencies without explicit graph priors.
It employs focal joint selection, body-part abstraction, and joint–part cross-attention to enhance discriminative feature representation.
The framework fuses global self-attention with dilated convolutions to capture both long-range and local temporal dynamics in action recognition.

The Global Skeleton-relational Transformer (GST) is a transformer-based architectural paradigm for modeling structured spatial and temporal dependencies in skeleton-based action recognition. GST is realized within the FG-STFormer network, which integrates focal and global spatial-temporal attention across both joints and body-parts, leveraging learned relational graphs and local temporal dynamics without explicit hard-wiring of graph adjacency. The GST mechanism unifies spatial and temporal self-attention, focal joint selection, body-part abstraction, joint–part cross-attention, and dilated convolutional temporal biases, thereby enabling comprehensive and flexible modeling of skeleton dynamics (Gao et al., 2022).

1. Spatial Attention over Skeleton Structures

The spatial module in FG-STFormer, termed FG-SFormer, encapsulates three principal sub-components: Basic-SFormer self-attention, focal joint selection, and joint–part mutual cross-attention (JP-CA). Basic-SFormer computes multi-head self-attention over the skeleton tokens, operating either on selected focal joints ( $X^J \in \mathbb{R}^{K\times T\times C}$ ) or aggregated body parts ( $X^P \in \mathbb{R}^{P\times T\times C}$ ), where $K$ is the number of selected joints, $P$ the number of parts, $T$ the temporal dimension, and $C$ the feature dimension.

Self-attention proceeds as follows:

$Q^h = XW_Q^h$ , $K^h = XW_K^h$ , and $V^h = XW_V^h$ for head $h = 1, ..., H$ .
Attention map: $A^h = \mathrm{softmax}\left(\frac{Q^h (K^h)^T}{\sqrt{d}}\right)$ , $O^h = A^h V^h$ .
Output: $O = \mathrm{Concat}_h(O^h) W_O + X$ , followed by FFN and LayerNorm.

This structure enables modeling of both intra-joint and inter-part relations for each frame, reflecting a global skeleton relational graph emergent from attention weights.

2. Focal Joint Selection and Body-part Abstraction

A focal joint selection mechanism identifies the most informative joints at each time frame. A per-joint, per-frame score map ( $S \in \mathbb{R}^{N \times T}$ , $N$ being the number of joints) is learned by projecting the joint-wise feature tensor $X_2$ :

$S = \mathrm{sigmoid}\left(\frac{X_2 w_p}{\|w_p\|}\right),\ w_p \in \mathbb{R}^C$ .

For every $t$ , the top- $K$ scoring joints are chosen, forming $X_2^J$ for downstream self-attention. In parallel, joints are partitioned into $P$ body-parts, whose feature tokens $X_2^P$ are generated by aggregating their member joint features and projecting to $C$ dimensions. Separate attention blocks on body-parts impart global contextual modeling and increase abstraction while the selection process enforces sparsification and focus on discriminative regions.

3. Joint–Part Mutual Cross-Attention Mechanism

Mutual cross-attention (JP-CA) links the focal joint and body-part branches by sharing information through attention-driven feature fusion. For "part→joint" direction:

$Q_J = X^J W_Q^J$ , $K_P = X^P W_K^P$ , $V_P = X^P W_V^P$ ,
$A_{J \leftarrow P} = \mathrm{softmax}\left(\frac{Q_J K_P^T}{\sqrt{d}}\right)$ ,
$O_{J \leftarrow P} = A_{J \leftarrow P} V_P$ .

This output updates the joint features, symmetrically implemented for "joint→part." Embedding these cross-attentions enables the encoding of kinematic correlations between joints and their respective body-parts, approximating soft edges in the skeleton's relational graph. The pipeline alternates Basic-SFormer and JP-CA in a deeply residual manner, preserving normalization and gradient flow.

4. Temporal Relational Attention with Dilated Convolutions

Temporal dependencies are modeled by the FG-TFormer module, which combines global temporal self-attention and local temporal aggregation via dilated 1D convolutions. For temporal input $X^l \in \mathbb{R}^{M \times T \times C}$ , each attention head generates queries and keys as linear projections, while values are extracted by:

$V^h = \mathrm{DilatedConv}_{k_t, d_t}(X^l)$ projected to $d$ .

The kernel size $k_t$ and dilation $d_t$ control local receptive field, imparting inductive bias toward short-range motion patterns. The attention map over time steps ( $A^h \in \mathbb{R}^{T \times T}$ ) is fused with local values as $HeadOutput^h = A^h V^h + V^h$ , ensuring representations capture both global temporal edges and local dynamics. Concatenation, linear projection, residual addition, and feed-forward normalization complete the temporal update.

5. Emergent Skeleton-relational Graphs

GST does not utilize explicit graph adjacency matrices. Instead, learned attention maps $A^h$ from spatial and temporal modules naturally induce soft-relational graphs:

In Basic-SFormer, spatial $A^h_t \in \mathbb{R}^{N\times N}$ for frame $t$ quantify pairwise joint/part affinities.
In FG-TFormer, temporal $A^h_n \in \mathbb{R}^{T\times T}$ for joint or part $n$ encode global temporal relations.

The selection mechanism roughly enforces node types (limbs, joints), JP-CA learns inter-modality edges, and dilated convolution explicitly adds "short-range" temporal edges. This configures a dynamic, data-driven skeleton-relational graph exploited by attention.

6. FG-STFormer Network Pipeline and Classification

Input skeleton sequences $X_{in} \in \mathbb{R}^{N\times T\times C_0}$ are linearly projected and processed through two main stages. Stage 1 (with $L_1$ layers) alternates spatial (all-joint attention) and temporal (FG-TFormer) modules, generating joint features $X_2$ . Stage 2 (with $L_2$ layers) bifurcates into joint and part branches, applying FG-SFormer (Basic-SFormer and JP-CA) and FG-TFormer in parallel and synchrony.

After pooling over spatio-temporal dimensions ( $K\times T$ for joints, $P\times T$ for parts), features are concatenated and passed to an MLP and softmax for classification. This design unifies joint-level sparsification, body-part abstraction, mutual cross-attention, and convolutionally-informed temporal attention atop a pure transformer backbone, enabling flexible and robust skeleton-based action recognition. Attention maps serve as relational graphs in both spatial and temporal domains (Gao et al., 2022).

7. Relations to Action Recognition Methodologies and Benchmarks

FG-STFormer has demonstrated competitive performance on NTU-60, NTU-120, and NW-UCLA benchmarks, outperforming previous transformer-based architectures and comparing favorably to state-of-the-art GCN-based methods (Gao et al., 2022). The coupling of focal and global relational modeling distinguishes GST from prior approaches that compute attention uniformly across all joints, as GST selectively accentuates discriminative features and context, incorporates explicit local temporal modeling, and realizes soft relational graphs end-to-end within the transformer framework. This suggests that GST provides increased flexibility and representational capacity compared to approaches with fixed graph priors or undifferentiated attention.

Markdown Report Issue Upgrade to Chat

References (1)

Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Global Skeleton-relational Transformer (GST).