Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 166 tok/s Pro

GPT OSS 120B 436 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding (2509.21223v1)

Published 25 Sep 2025 in cs.CV and cs.CL

Abstract: Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and LLMling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.

Summary

The paper introduces Sigma, a unified framework that employs semantically informed pre-training to improve isolated recognition, continuous recognition, and gloss-free translation tasks.
The methodology integrates SignEF for bidirectional cross-modal feature fusion, HAL for local and global alignment, and a dual-path SGT encoder for sign-text matching and language modeling.
Comprehensive evaluations across multiple benchmarks demonstrate that Sigma outperforms traditional approaches with improved accuracy, efficiency, and semantic grounding.

Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding

Introduction and Motivation

Sigma addresses three persistent challenges in skeleton-based sign language understanding (SLU): weak semantic grounding of visual features, imbalance between local and global feature modeling, and inefficient cross-modal alignment. Existing skeleton-based SLU models often capture low-level motion patterns but fail to relate them to linguistic meaning, struggle to balance fine-grained gesture details with global context, and are limited in their ability to align visual and textual modalities. Sigma proposes a unified framework that leverages semantically informed pre-training to overcome these limitations, enabling robust performance across isolated sign language recognition (ISLR), continuous sign language recognition (CSLR), and gloss-free sign language translation (SLT).

Figure 1: Overview of Sigma. (a) SignEF enhances visual-linguistic alignment by injecting cross-modal features into sign and text encoders. (b) HAL is used to maximise global and local cluster agreement. (c) SGT encoder jointly optimises sign-text matching and LLMling. During fine-tuning, both the sign and SGT encoders are reused across SLU tasks.

Methodology

Sign-aware Early Fusion (SignEF)

SignEF introduces bidirectional cross-modal interaction at the encoding stage, enabling deep integration of visual (skeletal) and textual features. This is implemented via cross-attention layers that inject textual cues into the visual encoding process and vice versa. The fusion is applied in the final layers of the encoders, with parameter sharing across attention heads to promote fine-grained alignment while maintaining efficiency. Empirical results show that the optimal number of fusion layers is task-dependent: shallow fusion benefits context-sensitive tasks (CSLR, SLT), while deeper fusion is advantageous for visually intensive tasks (ISLR).

Hierarchical Alignment Learning (HAL)

HAL maximizes agreement between sign-text pairs at both global and local cluster levels using contrastive objectives. Global alignment is achieved by projecting class-token representations from both modalities into a shared space, while local alignment is enforced by clustering subword-level textual tokens and aligning them with visual segments. The cluster aggregator module dynamically groups tokens into semantically meaningful units, approximating gloss-like groupings without requiring manual annotation.

Figure 2: The overview of the cluster aggregator module. It converts sub-word token embeddings into cluster-level representations by grouping tokens, mapping them with offset indices, and aggregating hidden features for semantic alignment with visual inputs.

The HAL loss is a weighted sum of global and local contrastive losses, with a trade-off parameter $\alpha$ controlling the balance. Empirical analysis demonstrates that $\alpha=0.5$ yields optimal results across all tasks, confirming the necessity of modeling both detailed gesture information and high-level semantics.

Figure 4: Illustration of the local cluster-wise contrastive learning process. The similarity matrix is computed between visual tokens and textual clusters, with in-batch local cluster contrastive learning applied to pull semantically aligned pairs closer.

Sign-grounded Text (SGT) Encoder

The SGT encoder unifies sign-text matching (STM) and LLMing (LM) within a dual-path architecture. The STM path uses cross-attention to align textual tokens with sign features, while the LM path preserves linguistic fluency via standard transformer layers. The composite SGT loss balances discriminative matching and generative modeling, with the trade-off parameter $\beta$ tuned per task. This design enables the model to learn representations that are both semantically grounded in visual input and linguistically coherent.

Experimental Results

Sigma is evaluated on four benchmarks: WLASL2000 (ISLR), CSL-Daily (CSLR, SLT), How2Sign (SLT), and OpenASL (SLT). The model consistently achieves state-of-the-art results across all tasks and datasets, outperforming both skeleton-based and pose-RGB-based baselines.

ISLR (WLASL2000): Sigma achieves 64.40% per-instance and 62.32% per-class Top-1 accuracy, surpassing prior SOTA by a significant margin.
CSLR (CSL-Daily): Sigma attains a WER of 25.92% on the test set, outperforming the previous best by 0.08–0.28 absolute WER.
SLT (How2Sign, OpenASL, CSL-Daily): Sigma sets new SOTA on OpenASL and How2Sign, and matches or exceeds gloss-based models on CSL-Daily, despite being gloss-free.
Figure 5: Visualization derived from the WLASL2000 dataset. The right hand and its motion trajectory are highlighted for two sign sequences, “Always” and “Someone.” Disambiguation requires both local detail and global temporal understanding, motivating Sigma’s multimodal representation learning.

Figure 7: t-SNE visualisation of ISLR on WLASL, showing improved cross-modal alignment and cluster compactness with Sigma’s pre-training.

Figure 9: t-SNE visualisation of CSLR on CSL-Daily, demonstrating enhanced alignment between sign and text features after pre-training.

Figure 3: t-SNE visualization of SLT on CSL-Daily, illustrating the effect of pre-training on cross-modal semantic structure.

Qualitative analysis further demonstrates that Sigma’s outputs preserve semantic meaning even when lexical or syntactic variations occur, reflecting robust semantic grounding and cross-modal alignment.

Ablation and Analysis

Ablation studies confirm the necessity of each core component:

Removing pre-training or omitting the sign encoder/SGT decoder leads to severe performance degradation.
The cluster aggregator and local cluster-wise contrastive learning are critical for effective alignment, with row-wise max and softmax-based scoring yielding the best results.
Unfreezing the text encoder during pre-training provides consistent, albeit modest, improvements.

Sigma’s reliance on skeletal data confers significant advantages in storage, computational efficiency, and privacy compared to RGB-based approaches, with negligible loss in recognition or translation quality.

Implications and Future Directions

Sigma demonstrates that semantically informed pre-training, hierarchical alignment, and unified cross-modal modeling can overcome longstanding challenges in skeleton-based SLU. The framework’s ability to generalize across languages, tasks, and datasets without reliance on costly gloss annotations is particularly notable. This has direct implications for scalable, privacy-preserving, and annotation-efficient SLU systems deployable in real-world assistive technologies.

Theoretically, Sigma’s approach to cluster-wise alignment and early cross-modal fusion may inform future work in other multimodal domains where fine-grained and global semantic integration is required. Practically, the modularity of the framework allows for extension to additional modalities (e.g., RGB, depth) or adaptation to low-resource sign languages.

Conclusion

Sigma establishes a new standard for skeleton-based sign language understanding by integrating sign-aware early fusion, hierarchical alignment learning, and unified pre-training objectives. The model achieves robust, transferable, and semantically grounded representations, enabling SOTA performance across ISLR, CSLR, and gloss-free SLT. The results validate the importance of semantically informative pre-training and provide a scalable path forward for multimodal language understanding systems.