2000 character limit reached

SUGAR: Skeleton Recognition with Visual-Motion

Updated 15 November 2025

The paper introduces SUGAR, a novel paradigm that integrates visual and motion knowledge with skeleton representations to enhance action recognition accuracy.
It employs contrastive pre-training and a Temporal Query Projection module to align skeleton and semantic text embeddings efficiently.
Experimental findings show state-of-the-art performance and robust zero-shot transfer across benchmarks, highlighting practical benefits in low-data environments.

Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition (SUGAR) is a paradigm for action recognition that integrates off-the-shelf video models and LLMs to address the limitations of skeleton modality. While skeleton data is computationally efficient and expressively rich, it lacks high-level semantic priors derived from visual context and fine-grained motion descriptions, which are needed to distinguish subtly different human actions. SUGAR remedies this through three core stages: harvesting text-based visual and motion knowledge automatically, contrastively pre-training a skeleton encoder to produce tokenized representations aligned with this knowledge, and projecting these tokens into an LLM (with untouched pre-training weights) via a lightweight temporal querying module for classification and description.

1. Visual and Motion Knowledge Harvesting

SUGAR leverages two distinct knowledge sources—motion and visual priors—by docking pre-trained generative and vision–LLMs. For motion knowledge, each action label from a dictionary (e.g., "Drink From Bottle") is sent as a prompt to GPT-3.5-turbo: “Decompose the action Drink From Bottle into six body-part movements (head, arm, hand, hip, leg, foot). Describe each movement briefly.” The returned $\mathcal{T}_m$ documents fine-grained motion trajectories per body part, such as "hand grasps cup" and "head tilts back."

Visual knowledge is obtained using GPT-4V, with a CLIP-based frame sampler that embeds all video frames and greedily selects a subset representing maximal semantic diversity. Each selected frame is captioned under strict constraints to yield $\{\mathcal{T}_{v_i}\}$ , focusing exclusively on scene elements necessary for action recognition.

Both $\mathcal{T}_m$ and $\mathcal{T}_{v_i}$ are embedded via the frozen CLIP text encoder $E_t(\cdot)$ : $m = E_t(\mathcal{T}_m), \quad v_i = E_t(\mathcal{T}_{v_i})$ During training, a "bag" of text descriptors $\mathbf{t} = \{m, v_i\}$ is formed for each instance via random selection.

2. Skeleton Representation Learning and Contrastive Supervision

Raw skeleton input is modeled as a spatio-temporal graph $G = (V, E)$ and encoded through a series of graph-convolutional blocks; specifically, CTR-GCN architecture is adopted without temporal pooling. At layer $l$ , the node feature update follows: $H^{l+1} = \sigma(D^{-1/2} A D^{-1/2} H^l W^l)$ where $A$ is the adjacency matrix, $D$ is its degree matrix, and $\sigma$ is a nonlinear activation.

Supervision employs a multi-instance contrastive InfoNCE loss to align skeleton features $s \in \mathbb{R}^d$ with multiple text embeddings $\{t_{j,n}\}$ : $\mathcal{L}_\mathrm{MIL} = -\frac{1}{|B|} \sum_{i \in B} \log \frac{\sum_{j \in B} \sum_n \exp(s_i^\top t_{j,n}/\tau)} {\sum_{k \in B} \sum_n \exp(s_i^\top t_{k,n}/\tau)}$ Minimizing $\mathcal{L}_\mathrm{MIL}$ draws each skeleton embedding close to its paired motion and visual descriptors and separates it from unrelated examples. The resulting skeleton encoder outputs tokenized representations (of length $L_s$ ) in the same vector space as the semantic text.

3. Temporal Query Projection (TQP) and LLM Integration

Directly feeding a long skeleton-token sequence into the LLM presents efficiency and representation misalignment challenges. The Temporal Query Projection (TQP) module compresses $s \in \mathbb{R}^{L_s \times d}$ into $k$ learnable query vectors, exploiting sequential Q-Former blocks (from BLIP-2) to maintain temporal ordering:

Skeleton embeddings are sliced into $t = L_s / k$ non-overlapping segments of length $k$ .
Queries are initialized as $q^{(0)} \in \mathbb{R}^{k \times d}$ .
Each Q-Former block refines the queries via $q^{(i)} = f_Q(q^{(i-1)}, s^{(i)})$ , with $i = 1, ..., t$ .

After $t$ iterations, the final $q^{(t)}$ is projected linearly to the LLM’s embedding dimension: $\hat{s} = \mathrm{Linear}(q^{(t)})$ This arrangement preserves the sequence structure while enabling efficient LLM input.

For classification and description, LoRA-based fine-tuning is applied to the LLM (LLaMA-2 7B) with low-rank adapters (rank $r=64$ , scaling $\alpha=16$ ) and frozen main weights. The input prompt concatenates a fixed instruction, the compressed action tokens, and the action list. Training employs a standard cross-entropy loss: $\mathcal{L}_\mathrm{LoRA} = \mathrm{CrossEntropy}(f_\mathrm{LLM}(\hat{s}), y)$ where $y$ encodes the category and description tokens.

4. Training and Inference Protocol

Training proceeds in two distinct phases:

Skeleton encoder (CTR-GCN) is trained or fine-tuned for 200 epochs under $\mathcal{L}_\mathrm{MIL}$ (SGD, initial LR 0.01, batch size 200).
Encoder is frozen; only LoRA adapters are tuned for one epoch under $\mathcal{L}_\mathrm{LoRA}$ (AdamW, LR $2 \times 10^{-5}$ , batch size 128).

Inference discards text-generation machinery: a raw skeleton sequence is encoded, projected by TQP, and the LLM (without updated weights) emits the action label and free-form description.

5. Experimental Findings

SUGAR demonstrates competitive and state-of-the-art results on skeleton-based action recognition benchmarks, as shown in the following summary:

Dataset/Split	SUGAR Accuracy	Prev. Best (LLM-AR)	Gain
Toyota Smarthome (X-sub)	70.2%	67.0%	+3.2%
Toyota Smarthome (X-view1)	50.9%	36.1%	+14.8%
NTU-60 (X-sub/X-view)	95.2% / 97.8%	—	—
NTU-120 (X-sub/X-view)	90.1% / 89.7%	—	—

In zero-shot transfer:

NTU-60 → unseen NTU-120 (Top-1/Top-5): 65.3% / 89.8% (LLM-AR: 59.7% / 84.1%, ST-GCN: 30.1% / 45.2%)
NTU-60 → PKU-MMD (Top-1/Top-5): 53.4% / 77.6% (LLM-AR: 49.4% / 74.2%, ST-GCN: 36.9% / 55.2%)

Ablations reveal additive benefits of motion (+2.9%) and visual (+0.2%) priors, with both combined yielding +4.2% over baseline. Bridging modules using single linear layers (70.4%), Q-Former (70.7%), and TQP (73.4%) demonstrate the efficacy of sequence-aware compression. Token length sweeps determine peak performance at 64–128 tokens; extreme compression or full-length substantially reduces accuracy.

Visualization with t-SNE shows effective separation between ambiguous classes post-MIL supervision, indicating robust semantic alignment.

6. Technical Significance and Future Implications

SUGAR introduces a method for synthesizing visual-motion knowledge with skeleton representations, moving beyond reliance on manual annotations and closed-form classifier architectures. By constraining skeleton encodings via automatically generated semantic descriptions and leveraging a pretrained LLM with minimal adaptation, SUGAR unlocks state-of-the-art accuracy, robust zero-shot transfer, and natural-language output—all with limited labeled data and a lightweight skeleton modality. This suggests that downstream tasks beyond action classification (e.g., sequence generation, open-vocabulary labeling) may benefit from such docked knowledge distillation and modular querying strategies. A plausible implication is that further integration of diverse knowledge bases, coupled with advanced token compression, could yield improvements in real-world deployability, interpretability, and adaptability to new tasks.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR).