Papers
Topics
Authors
Recent
2000 character limit reached

Global Skeleton-relational Transformer (GST)

Updated 3 December 2025
  • GST is a transformer-based architecture that models structured spatial and temporal dependencies without explicit graph priors.
  • It employs focal joint selection, body-part abstraction, and joint–part cross-attention to enhance discriminative feature representation.
  • The framework fuses global self-attention with dilated convolutions to capture both long-range and local temporal dynamics in action recognition.

The Global Skeleton-relational Transformer (GST) is a transformer-based architectural paradigm for modeling structured spatial and temporal dependencies in skeleton-based action recognition. GST is realized within the FG-STFormer network, which integrates focal and global spatial-temporal attention across both joints and body-parts, leveraging learned relational graphs and local temporal dynamics without explicit hard-wiring of graph adjacency. The GST mechanism unifies spatial and temporal self-attention, focal joint selection, body-part abstraction, joint–part cross-attention, and dilated convolutional temporal biases, thereby enabling comprehensive and flexible modeling of skeleton dynamics (Gao et al., 2022).

1. Spatial Attention over Skeleton Structures

The spatial module in FG-STFormer, termed FG-SFormer, encapsulates three principal sub-components: Basic-SFormer self-attention, focal joint selection, and joint–part mutual cross-attention (JP-CA). Basic-SFormer computes multi-head self-attention over the skeleton tokens, operating either on selected focal joints (XJRK×T×CX^J \in \mathbb{R}^{K\times T\times C}) or aggregated body parts (XPRP×T×CX^P \in \mathbb{R}^{P\times T\times C}), where KK is the number of selected joints, PP the number of parts, TT the temporal dimension, and CC the feature dimension.

Self-attention proceeds as follows:

  • Qh=XWQhQ^h = XW_Q^h, Kh=XWKhK^h = XW_K^h, and Vh=XWVhV^h = XW_V^h for head h=1,...,Hh = 1, ..., H.
  • Attention map: Ah=softmax(Qh(Kh)Td)A^h = \mathrm{softmax}\left(\frac{Q^h (K^h)^T}{\sqrt{d}}\right), Oh=AhVhO^h = A^h V^h.
  • Output: O=Concath(Oh)WO+XO = \mathrm{Concat}_h(O^h) W_O + X, followed by FFN and LayerNorm.

This structure enables modeling of both intra-joint and inter-part relations for each frame, reflecting a global skeleton relational graph emergent from attention weights.

2. Focal Joint Selection and Body-part Abstraction

A focal joint selection mechanism identifies the most informative joints at each time frame. A per-joint, per-frame score map (SRN×TS \in \mathbb{R}^{N \times T}, NN being the number of joints) is learned by projecting the joint-wise feature tensor X2X_2:

  • S=sigmoid(X2wpwp), wpRCS = \mathrm{sigmoid}\left(\frac{X_2 w_p}{\|w_p\|}\right),\ w_p \in \mathbb{R}^C.

For every tt, the top-KK scoring joints are chosen, forming X2JX_2^J for downstream self-attention. In parallel, joints are partitioned into PP body-parts, whose feature tokens X2PX_2^P are generated by aggregating their member joint features and projecting to CC dimensions. Separate attention blocks on body-parts impart global contextual modeling and increase abstraction while the selection process enforces sparsification and focus on discriminative regions.

3. Joint–Part Mutual Cross-Attention Mechanism

Mutual cross-attention (JP-CA) links the focal joint and body-part branches by sharing information through attention-driven feature fusion. For "part→joint" direction:

  • QJ=XJWQJQ_J = X^J W_Q^J, KP=XPWKPK_P = X^P W_K^P, VP=XPWVPV_P = X^P W_V^P,
  • AJP=softmax(QJKPTd)A_{J \leftarrow P} = \mathrm{softmax}\left(\frac{Q_J K_P^T}{\sqrt{d}}\right),
  • OJP=AJPVPO_{J \leftarrow P} = A_{J \leftarrow P} V_P.

This output updates the joint features, symmetrically implemented for "joint→part." Embedding these cross-attentions enables the encoding of kinematic correlations between joints and their respective body-parts, approximating soft edges in the skeleton's relational graph. The pipeline alternates Basic-SFormer and JP-CA in a deeply residual manner, preserving normalization and gradient flow.

4. Temporal Relational Attention with Dilated Convolutions

Temporal dependencies are modeled by the FG-TFormer module, which combines global temporal self-attention and local temporal aggregation via dilated 1D convolutions. For temporal input XlRM×T×CX^l \in \mathbb{R}^{M \times T \times C}, each attention head generates queries and keys as linear projections, while values are extracted by:

  • Vh=DilatedConvkt,dt(Xl)V^h = \mathrm{DilatedConv}_{k_t, d_t}(X^l) projected to dd.

The kernel size ktk_t and dilation dtd_t control local receptive field, imparting inductive bias toward short-range motion patterns. The attention map over time steps (AhRT×TA^h \in \mathbb{R}^{T \times T}) is fused with local values as HeadOutputh=AhVh+VhHeadOutput^h = A^h V^h + V^h, ensuring representations capture both global temporal edges and local dynamics. Concatenation, linear projection, residual addition, and feed-forward normalization complete the temporal update.

5. Emergent Skeleton-relational Graphs

GST does not utilize explicit graph adjacency matrices. Instead, learned attention maps AhA^h from spatial and temporal modules naturally induce soft-relational graphs:

  • In Basic-SFormer, spatial AthRN×NA^h_t \in \mathbb{R}^{N\times N} for frame tt quantify pairwise joint/part affinities.
  • In FG-TFormer, temporal AnhRT×TA^h_n \in \mathbb{R}^{T\times T} for joint or part nn encode global temporal relations.

The selection mechanism roughly enforces node types (limbs, joints), JP-CA learns inter-modality edges, and dilated convolution explicitly adds "short-range" temporal edges. This configures a dynamic, data-driven skeleton-relational graph exploited by attention.

6. FG-STFormer Network Pipeline and Classification

Input skeleton sequences XinRN×T×C0X_{in} \in \mathbb{R}^{N\times T\times C_0} are linearly projected and processed through two main stages. Stage 1 (with L1L_1 layers) alternates spatial (all-joint attention) and temporal (FG-TFormer) modules, generating joint features X2X_2. Stage 2 (with L2L_2 layers) bifurcates into joint and part branches, applying FG-SFormer (Basic-SFormer and JP-CA) and FG-TFormer in parallel and synchrony.

After pooling over spatio-temporal dimensions (K×TK\times T for joints, P×TP\times T for parts), features are concatenated and passed to an MLP and softmax for classification. This design unifies joint-level sparsification, body-part abstraction, mutual cross-attention, and convolutionally-informed temporal attention atop a pure transformer backbone, enabling flexible and robust skeleton-based action recognition. Attention maps serve as relational graphs in both spatial and temporal domains (Gao et al., 2022).

7. Relations to Action Recognition Methodologies and Benchmarks

FG-STFormer has demonstrated competitive performance on NTU-60, NTU-120, and NW-UCLA benchmarks, outperforming previous transformer-based architectures and comparing favorably to state-of-the-art GCN-based methods (Gao et al., 2022). The coupling of focal and global relational modeling distinguishes GST from prior approaches that compute attention uniformly across all joints, as GST selectively accentuates discriminative features and context, incorporates explicit local temporal modeling, and realizes soft relational graphs end-to-end within the transformer framework. This suggests that GST provides increased flexibility and representational capacity compared to approaches with fixed graph priors or undifferentiated attention.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Global Skeleton-relational Transformer (GST).