Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

SUGAR: Skeleton Recognition with Visual-Motion

Updated 15 November 2025
  • The paper introduces SUGAR, a novel paradigm that integrates visual and motion knowledge with skeleton representations to enhance action recognition accuracy.
  • It employs contrastive pre-training and a Temporal Query Projection module to align skeleton and semantic text embeddings efficiently.
  • Experimental findings show state-of-the-art performance and robust zero-shot transfer across benchmarks, highlighting practical benefits in low-data environments.

Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition (SUGAR) is a paradigm for action recognition that integrates off-the-shelf video models and LLMs to address the limitations of skeleton modality. While skeleton data is computationally efficient and expressively rich, it lacks high-level semantic priors derived from visual context and fine-grained motion descriptions, which are needed to distinguish subtly different human actions. SUGAR remedies this through three core stages: harvesting text-based visual and motion knowledge automatically, contrastively pre-training a skeleton encoder to produce tokenized representations aligned with this knowledge, and projecting these tokens into an LLM (with untouched pre-training weights) via a lightweight temporal querying module for classification and description.

1. Visual and Motion Knowledge Harvesting

SUGAR leverages two distinct knowledge sources—motion and visual priors—by docking pre-trained generative and vision–LLMs. For motion knowledge, each action label from a dictionary (e.g., "Drink From Bottle") is sent as a prompt to GPT-3.5-turbo: “Decompose the action Drink From Bottle into six body-part movements (head, arm, hand, hip, leg, foot). Describe each movement briefly.” The returned Tm\mathcal{T}_m documents fine-grained motion trajectories per body part, such as "hand grasps cup" and "head tilts back."

Visual knowledge is obtained using GPT-4V, with a CLIP-based frame sampler that embeds all video frames and greedily selects a subset representing maximal semantic diversity. Each selected frame is captioned under strict constraints to yield {Tvi}\{\mathcal{T}_{v_i}\}, focusing exclusively on scene elements necessary for action recognition.

Both Tm\mathcal{T}_m and Tvi\mathcal{T}_{v_i} are embedded via the frozen CLIP text encoder Et()E_t(\cdot): m=Et(Tm),vi=Et(Tvi)m = E_t(\mathcal{T}_m), \quad v_i = E_t(\mathcal{T}_{v_i}) During training, a "bag" of text descriptors t={m,vi}\mathbf{t} = \{m, v_i\} is formed for each instance via random selection.

2. Skeleton Representation Learning and Contrastive Supervision

Raw skeleton input is modeled as a spatio-temporal graph G=(V,E)G = (V, E) and encoded through a series of graph-convolutional blocks; specifically, CTR-GCN architecture is adopted without temporal pooling. At layer ll, the node feature update follows: Hl+1=σ(D1/2AD1/2HlWl)H^{l+1} = \sigma(D^{-1/2} A D^{-1/2} H^l W^l) where AA is the adjacency matrix, DD is its degree matrix, and σ\sigma is a nonlinear activation.

Supervision employs a multi-instance contrastive InfoNCE loss to align skeleton features sRds \in \mathbb{R}^d with multiple text embeddings {tj,n}\{t_{j,n}\}: LMIL=1BiBlogjBnexp(sitj,n/τ)kBnexp(sitk,n/τ)\mathcal{L}_\mathrm{MIL} = -\frac{1}{|B|} \sum_{i \in B} \log \frac{\sum_{j \in B} \sum_n \exp(s_i^\top t_{j,n}/\tau)} {\sum_{k \in B} \sum_n \exp(s_i^\top t_{k,n}/\tau)} Minimizing LMIL\mathcal{L}_\mathrm{MIL} draws each skeleton embedding close to its paired motion and visual descriptors and separates it from unrelated examples. The resulting skeleton encoder outputs tokenized representations (of length LsL_s) in the same vector space as the semantic text.

3. Temporal Query Projection (TQP) and LLM Integration

Directly feeding a long skeleton-token sequence into the LLM presents efficiency and representation misalignment challenges. The Temporal Query Projection (TQP) module compresses sRLs×ds \in \mathbb{R}^{L_s \times d} into kk learnable query vectors, exploiting sequential Q-Former blocks (from BLIP-2) to maintain temporal ordering:

  • Skeleton embeddings are sliced into t=Ls/kt = L_s / k non-overlapping segments of length kk.
  • Queries are initialized as q(0)Rk×dq^{(0)} \in \mathbb{R}^{k \times d}.
  • Each Q-Former block refines the queries via q(i)=fQ(q(i1),s(i))q^{(i)} = f_Q(q^{(i-1)}, s^{(i)}), with i=1,...,ti = 1, ..., t.

After tt iterations, the final q(t)q^{(t)} is projected linearly to the LLM’s embedding dimension: s^=Linear(q(t))\hat{s} = \mathrm{Linear}(q^{(t)}) This arrangement preserves the sequence structure while enabling efficient LLM input.

For classification and description, LoRA-based fine-tuning is applied to the LLM (LLaMA-2 7B) with low-rank adapters (rank r=64r=64, scaling α=16\alpha=16) and frozen main weights. The input prompt concatenates a fixed instruction, the compressed action tokens, and the action list. Training employs a standard cross-entropy loss: LLoRA=CrossEntropy(fLLM(s^),y)\mathcal{L}_\mathrm{LoRA} = \mathrm{CrossEntropy}(f_\mathrm{LLM}(\hat{s}), y) where yy encodes the category and description tokens.

4. Training and Inference Protocol

Training proceeds in two distinct phases:

  1. Skeleton encoder (CTR-GCN) is trained or fine-tuned for 200 epochs under LMIL\mathcal{L}_\mathrm{MIL} (SGD, initial LR 0.01, batch size 200).
  2. Encoder is frozen; only LoRA adapters are tuned for one epoch under LLoRA\mathcal{L}_\mathrm{LoRA} (AdamW, LR 2×1052 \times 10^{-5}, batch size 128).

Inference discards text-generation machinery: a raw skeleton sequence is encoded, projected by TQP, and the LLM (without updated weights) emits the action label and free-form description.

5. Experimental Findings

SUGAR demonstrates competitive and state-of-the-art results on skeleton-based action recognition benchmarks, as shown in the following summary:

Dataset/Split SUGAR Accuracy Prev. Best (LLM-AR) Gain
Toyota Smarthome (X-sub) 70.2% 67.0% +3.2%
Toyota Smarthome (X-view1) 50.9% 36.1% +14.8%
NTU-60 (X-sub/X-view) 95.2% / 97.8%
NTU-120 (X-sub/X-view) 90.1% / 89.7%

In zero-shot transfer:

  • NTU-60 → unseen NTU-120 (Top-1/Top-5): 65.3% / 89.8% (LLM-AR: 59.7% / 84.1%, ST-GCN: 30.1% / 45.2%)
  • NTU-60 → PKU-MMD (Top-1/Top-5): 53.4% / 77.6% (LLM-AR: 49.4% / 74.2%, ST-GCN: 36.9% / 55.2%)

Ablations reveal additive benefits of motion (+2.9%) and visual (+0.2%) priors, with both combined yielding +4.2% over baseline. Bridging modules using single linear layers (70.4%), Q-Former (70.7%), and TQP (73.4%) demonstrate the efficacy of sequence-aware compression. Token length sweeps determine peak performance at 64–128 tokens; extreme compression or full-length substantially reduces accuracy.

Visualization with t-SNE shows effective separation between ambiguous classes post-MIL supervision, indicating robust semantic alignment.

6. Technical Significance and Future Implications

SUGAR introduces a method for synthesizing visual-motion knowledge with skeleton representations, moving beyond reliance on manual annotations and closed-form classifier architectures. By constraining skeleton encodings via automatically generated semantic descriptions and leveraging a pretrained LLM with minimal adaptation, SUGAR unlocks state-of-the-art accuracy, robust zero-shot transfer, and natural-language output—all with limited labeled data and a lightweight skeleton modality. This suggests that downstream tasks beyond action classification (e.g., sequence generation, open-vocabulary labeling) may benefit from such docked knowledge distillation and modular querying strategies. A plausible implication is that further integration of diverse knowledge bases, coupled with advanced token compression, could yield improvements in real-world deployability, interpretability, and adaptability to new tasks.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR).