Temporal Query Projection Module
- Temporal Query Projection (TQP) module is a specialized component that distills long-range skeleton signals into concise, context-rich embeddings using iterative cross-attention.
- It segments lengthy skeleton sequences into chunks and processes them with Q-Former blocks to preserve dynamic temporal properties for accurate action recognition.
- Integrated within the SUGAR framework, TQP outperforms baseline methods by maintaining rich temporal context, achieving up to 73.4% accuracy on benchmarks.
Temporal Query Projection (TQP) modules are specialized architectural components designed for efficiently summarizing long-range temporal signals in discrete skeleton-based action recognition pipelines. The TQP module notably serves as the bridging mechanism within the SUGAR framework, compressing frame-wise skeleton features into short sequences of dense embeddings adapted for LLM consumption without diluting dynamic temporal context. Its innovation rests on iterative cross-attention using Q-Former blocks, permitting rich temporal distillation and alignment with high-level visual-motion semantics.
1. Motivation and Problem Setting
Skeleton-based action recognition presents a unique challenge to LLMs due to the variable-length, high-dimensional nature of temporal skeleton features. While frame-level skeleton encoders (e.g., Graph Convolutional Networks, GCNs) can yield temporally detailed feature maps where may approach 1000 frames, contemporary LLMs cannot process such long input token sequences. Additionally, naive compression methods (linear projection, pooling) risk severe information loss, particularly of long-range or fine-grained dynamics essential for semantic reasoning. The TQP module is introduced to resolve these bottlenecks by systematically "distilling" temporal skeleton signals into short, context-rich token sequences in a discrete-friendly embedding space.
2. Temporal Query Projection Workflow
The TQP process executes a structured sequence of operations:
- Chunking: The input sequence is divided into contiguous segments , each spanning frames.
- Query Bank Initialization: A learnable query bank is established to act as the initial cross-attention query.
- Iterative Q-Former Distillation: For each chunk , a shared Q-Former block cross-attends the previous query with chunk :
Each Q-Former "distills" chunk-level dynamics into updated query vectors, recursively propagating information and maintaining long-range context.
- Final Projection: After iterations, the aggregated query representation is mapped through a linear layer to align with the LLM input dimension .
This sequencing ensures that temporal dependencies and salient dynamics are preserved, leveraging cross-attention for context propagation and feature selection.
3. Comparative Analysis of Temporal Bridging Modules
The SUGAR paper includes systematic ablation of bridging strategies, clarifying TQP's unique efficacy. The following table presents per-class accuracy for action recognition (Toyota SmartHome benchmark) under different temporal reduction schemes:
| Bridging Module | Accuracy (%) | Notable Method Notes |
|---|---|---|
| Cross-attention | 52.1 | Baseline, no temporal distillation |
| Single Q-Former | 70.7 | Non-iterative query pooling |
| Single Linear Proj. | 70.4 | Linear compression |
| Full TQP | 73.4 | Iterative Q-Formers + Linear |
These results establish that full TQP offers quantitatively better retention of action-relevant dynamics and context than alternatives. This suggests iterative Q-Former distillation is crucial for maintaining semantic fidelity in compressed representations.
4. Integration with Discrete-Friendly Skeleton Learning
The TQP module is deployed after a skeleton encoder trained via many-to-many contrastive supervision (MIL-NCE), aligning skeleton representations to multi-instance visual-motion text embeddings from GPT-3.5 and GPT-4V processed via CLIP-text encoder. This produces skeleton feature maps "friendly" for discrete tokenization, facilitating effective downstream compression without requiring significant LLM modification. The final output, , is treated as a short sequence of "word embeddings," which can be directly integrated into LLM inference pipelines for both classification and description tasks. A plausible implication is that the semantic structure inherited from MIL-NCE training makes TQP's queries more interpretable to the LLM's embeddings.
5. Impact of Token Length and Discrete Tokenization
Experimental ablation on the tokenization granularity, i.e., the compression factor , demonstrates that optimal performance arises as token count decreases from the raw sequence length down to –$128$. Performance collapses when , as excessive compression erases necessary dynamic context. This suggests a nontrivial trade-off: representation compactness vs. preservable detail. The tokenized embeddings function as effective virtual words, yielding both classification logits and natural language action descriptions from an LLM augmented only with Light-Rank Adaptation (LoRA) in attention modules.
6. Experimental Validation Across Benchmarks
The SUGAR framework—incorporating TQP—achieves robust state-of-the-art performance on standard skeleton-based action classification datasets. On Toyota SmartHome, SUGAR outperforms ST-GCN and LLM-AR by margins exceeding 7% in per-class accuracy (X-sub) and offers significant gains in cross-view settings. On PKU-MMD, NTU60, and NTU120, accuracy improvements range from 2% to 9% over prior skeleton-only classifiers and LLM-based approaches. Zero-shot generalization experiments (NTU60→PKU-MMD, NTU60→unseen NTU120 classes) yield substantial improvements in Top-1 and Top-5 metrics compared to linear and prior LLM methods. Final t-SNE projections of skeleton embeddings post-MIL loss underscored tight class separation—even for closely related actions—validating the effectiveness of alignment and compression by TQP.
7. Significance and Prospects
The TQP module represents a rigorous solution to the challenge of infusing temporal skeleton dynamics into LLM pipelines. Its architecture enables nuanced distribution of dynamic features across short embedding sequences, accommodating both computational limits and expressive needs of LLM classifiers and describers. As generative and discriminative models continue to integrate multimodal information, similar temporal distillation paradigms may become foundational in bridging variable-length sensory data to fixed-token NLP systems. The success of TQP within SUGAR suggests promising utility in future hybrid pipelines and highlights the importance of learnable, context-propagating compression mechanisms in action recognition and beyond.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free