SUGAR: Skeleton Recognition with Visual-Motion
- The paper introduces SUGAR, a novel paradigm that integrates visual and motion knowledge with skeleton representations to enhance action recognition accuracy.
- It employs contrastive pre-training and a Temporal Query Projection module to align skeleton and semantic text embeddings efficiently.
- Experimental findings show state-of-the-art performance and robust zero-shot transfer across benchmarks, highlighting practical benefits in low-data environments.
Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition (SUGAR) is a paradigm for action recognition that integrates off-the-shelf video models and LLMs to address the limitations of skeleton modality. While skeleton data is computationally efficient and expressively rich, it lacks high-level semantic priors derived from visual context and fine-grained motion descriptions, which are needed to distinguish subtly different human actions. SUGAR remedies this through three core stages: harvesting text-based visual and motion knowledge automatically, contrastively pre-training a skeleton encoder to produce tokenized representations aligned with this knowledge, and projecting these tokens into an LLM (with untouched pre-training weights) via a lightweight temporal querying module for classification and description.
1. Visual and Motion Knowledge Harvesting
SUGAR leverages two distinct knowledge sources—motion and visual priors—by docking pre-trained generative and vision–LLMs. For motion knowledge, each action label from a dictionary (e.g., "Drink From Bottle") is sent as a prompt to GPT-3.5-turbo: “Decompose the action Drink From Bottle into six body-part movements (head, arm, hand, hip, leg, foot). Describe each movement briefly.” The returned documents fine-grained motion trajectories per body part, such as "hand grasps cup" and "head tilts back."
Visual knowledge is obtained using GPT-4V, with a CLIP-based frame sampler that embeds all video frames and greedily selects a subset representing maximal semantic diversity. Each selected frame is captioned under strict constraints to yield , focusing exclusively on scene elements necessary for action recognition.
Both and are embedded via the frozen CLIP text encoder : During training, a "bag" of text descriptors is formed for each instance via random selection.
2. Skeleton Representation Learning and Contrastive Supervision
Raw skeleton input is modeled as a spatio-temporal graph and encoded through a series of graph-convolutional blocks; specifically, CTR-GCN architecture is adopted without temporal pooling. At layer , the node feature update follows: where is the adjacency matrix, is its degree matrix, and is a nonlinear activation.
Supervision employs a multi-instance contrastive InfoNCE loss to align skeleton features with multiple text embeddings : Minimizing draws each skeleton embedding close to its paired motion and visual descriptors and separates it from unrelated examples. The resulting skeleton encoder outputs tokenized representations (of length ) in the same vector space as the semantic text.
3. Temporal Query Projection (TQP) and LLM Integration
Directly feeding a long skeleton-token sequence into the LLM presents efficiency and representation misalignment challenges. The Temporal Query Projection (TQP) module compresses into learnable query vectors, exploiting sequential Q-Former blocks (from BLIP-2) to maintain temporal ordering:
- Skeleton embeddings are sliced into non-overlapping segments of length .
- Queries are initialized as .
- Each Q-Former block refines the queries via , with .
After iterations, the final is projected linearly to the LLM’s embedding dimension: This arrangement preserves the sequence structure while enabling efficient LLM input.
For classification and description, LoRA-based fine-tuning is applied to the LLM (LLaMA-2 7B) with low-rank adapters (rank , scaling ) and frozen main weights. The input prompt concatenates a fixed instruction, the compressed action tokens, and the action list. Training employs a standard cross-entropy loss: where encodes the category and description tokens.
4. Training and Inference Protocol
Training proceeds in two distinct phases:
- Skeleton encoder (CTR-GCN) is trained or fine-tuned for 200 epochs under (SGD, initial LR 0.01, batch size 200).
- Encoder is frozen; only LoRA adapters are tuned for one epoch under (AdamW, LR , batch size 128).
Inference discards text-generation machinery: a raw skeleton sequence is encoded, projected by TQP, and the LLM (without updated weights) emits the action label and free-form description.
5. Experimental Findings
SUGAR demonstrates competitive and state-of-the-art results on skeleton-based action recognition benchmarks, as shown in the following summary:
| Dataset/Split | SUGAR Accuracy | Prev. Best (LLM-AR) | Gain |
|---|---|---|---|
| Toyota Smarthome (X-sub) | 70.2% | 67.0% | +3.2% |
| Toyota Smarthome (X-view1) | 50.9% | 36.1% | +14.8% |
| NTU-60 (X-sub/X-view) | 95.2% / 97.8% | — | — |
| NTU-120 (X-sub/X-view) | 90.1% / 89.7% | — | — |
In zero-shot transfer:
- NTU-60 → unseen NTU-120 (Top-1/Top-5): 65.3% / 89.8% (LLM-AR: 59.7% / 84.1%, ST-GCN: 30.1% / 45.2%)
- NTU-60 → PKU-MMD (Top-1/Top-5): 53.4% / 77.6% (LLM-AR: 49.4% / 74.2%, ST-GCN: 36.9% / 55.2%)
Ablations reveal additive benefits of motion (+2.9%) and visual (+0.2%) priors, with both combined yielding +4.2% over baseline. Bridging modules using single linear layers (70.4%), Q-Former (70.7%), and TQP (73.4%) demonstrate the efficacy of sequence-aware compression. Token length sweeps determine peak performance at 64–128 tokens; extreme compression or full-length substantially reduces accuracy.
Visualization with t-SNE shows effective separation between ambiguous classes post-MIL supervision, indicating robust semantic alignment.
6. Technical Significance and Future Implications
SUGAR introduces a method for synthesizing visual-motion knowledge with skeleton representations, moving beyond reliance on manual annotations and closed-form classifier architectures. By constraining skeleton encodings via automatically generated semantic descriptions and leveraging a pretrained LLM with minimal adaptation, SUGAR unlocks state-of-the-art accuracy, robust zero-shot transfer, and natural-language output—all with limited labeled data and a lightweight skeleton modality. This suggests that downstream tasks beyond action classification (e.g., sequence generation, open-vocabulary labeling) may benefit from such docked knowledge distillation and modular querying strategies. A plausible implication is that further integration of diverse knowledge bases, coupled with advanced token compression, could yield improvements in real-world deployability, interpretability, and adaptability to new tasks.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free