SLTT: Sign Language Translation Transformer
- SLTT is a transformer-based model that converts sign language videos into spoken sentences using a hierarchical alignment of frame, segment, and video-level features.
- It leverages pseudo-gloss supervision and contrastive video-language objectives to bridge the visual-linguistic modality gap with minimal annotation.
- Empirical results on benchmarks like RWTH-PHOENIX-Weather 2014T demonstrate improved BLEU-4 scores and reduced trainable parameters compared to earlier gloss-free models.
A Sign Language Translation Transformer (SLTT) is a neural architecture based on the transformer paradigm specifically designed to convert sign language videos into spoken language sentences, typically in a gloss-free, end-to-end, and annotation-light regime. The focus is on leveraging the hierarchical linguistic structure of sign language and bridging the visual-linguistic modality gap, with minimal or no reliance on explicit gloss annotations. Recent research exemplified by (Asasi et al., 9 Jul 2025) has established new state-of-the-art methods for this task by incorporating hierarchical feature alignment, pseudo-gloss supervision, and contrastive video-language objectives.
1. Definition and Hierarchical Feature Alignment
In the SLTT framework, sign language videos are first represented at different linguistic levels:
- Frame Level: Fine-grained visual tokens, each representing a snapshot in the video sequence.
- Segment Level: Groups of frames corresponding approximately to gloss-like or sign-morpheme units, identified by temporal segmentation.
- Video Level: The global, sentence-level semantic embedding that summarizes the entire signing event.
The central architectural principle is explicit hierarchical pretraining in which features are extracted and aligned at both segment and video levels:
- At the segment level, visual features are aligned to pseudo-glosses—automatically derived, lemmatized units from the spoken sentence via POS-tagging and embedded with fastText. Soft, weakly supervised alignment is achieved by maximizing cosine similarity between each segment's projected features and pseudo-gloss prototypes, using a cross-entropy indicator loss.
- At the video level, global visual embeddings are contrastively aligned to spoken language sentence embeddings, using a CLIP-style bidirectional contrastive loss that ensures tight semantic coupling between modalities.
Both levels are incorporated into the total pretraining loss: where is the pseudo-gloss prototype loss and regulates their balance.
2. Transformer-based and Hybrid Architectural Components
Modern SLTT systems leverage specialized transformer components:
- Visual Encoder: Vision Transformer (ViT-S/14 from DinoV2) processes raw video frames, extracting semantically rich representations using learned global self-attention. Frame features are projected and batch-normalized.
- Temporal Encoder: Spatio-temporal transformer layers aggregate features over time using local self-attention (window size = 7), downsampling for long video management, and rotary position embeddings (RoPE) to account for temporal/spatial structure.
- LLM Encoder/Decoder: The architecture utilizes mBART (12-layer transformer) with LoRA for parameter-efficient adaptation, leveraging large language modeling capacity and robust fine-tuning of only adapters and selective encoder layers.
The integration of pre-trained visual and language transformer models ensures both high feature expressivity and fine-grained cross-modal representation learning.
3. Pretraining Regime and Pseudo-gloss Supervision
Pretraining occurs in distinct stages:
- Pseudo-gloss Alignment (Segment Level):
- For a segment feature and pseudo-gloss prototype matrix :
- Segment features are softly assigned to pseudo-glosses using softmax and temperature scaling across both time and prototype dimensions.
- The loss is calculated as the BCE between predicted segment-to-gloss assignments and the ground-truth presence indicators (from text-provided pseudo-gloss sets).
- Contrastive Video-Sentence Alignment (Video Level):
- Video and sentence representations are average-pooled and contrastively aligned:
- This bidirectional framing ensures both modalities are robustly mapped into a joint semantic space.
Ablation reveals that upweighting pseudo-gloss alignment () linearly increases BLEU-4 scores, confirming the critical value of multi-level guidance.
4. Technical Implementation and Training Efficiency
The SLTT pipeline achieves high empirical efficiency:
- Pretraining: Only adapters (LoRA) in mBART and a subset of encoder layers are trained, keeping parameter updates low (15.2M trainable vs. 215.6M in earlier gloss-free models).
- Fine-tuning: A lightweight linear mapper post-temporal encoder adapts video features for the LLM decoder.
- Optimization: Pretraining spans 100 epochs, then a two-phase fine-tuning: first training only the mapper/decoder, then all trainable parameters.
Features—such as 300d fastText embeddings for roughly 2,322 pseudo-glosses (PHOENIX14T)—are well matched to the visual representation space, and segment-level projections are handled with negligible additional overhead.
5. Empirical Results and State-of-the-Art Impact
Experiments conducted on RWTH-PHOENIX-Weather 2014T (German SLT benchmark) yield the strongest gloss-free, end-to-end results to date:
| Model | BLEU-4 | ROUGE |
|---|---|---|
| GFSLT-VLP | 21.44 | 42.49 |
| Sign2GPT | 22.52 | 48.90 |
| FLa-LLM | — | 45.27 |
| (This work) | 23.15 | 49.10 |
- Outperforms previous gloss-free methods on both BLEU-4 and ROUGE, surpassing Sign2GPT and FLa-LLM.
- Only the weakly-supervised VAP model (requiring near-gloss annotations) scores higher.
- Uses far fewer trainable parameters than computationally intensive baselines, supporting practical deployment and scalability.
6. Comparative and Methodological Significance
Relative to previous SLTT developments:
- Earlier models (GFSLT-VLP) lacked explicit fine-grained segment alignment, while Sign2GPT used pseudo-glosses only at the sentence level.
- FLa-LLM implemented a two-step curriculum but did not realize hierarchical (segment + video) feature alignment.
- The presented method is annotation-light but matches or exceeds their translation scores, sharply reducing the gap with gloss-supervised models without the need for manual gloss annotation.
The approach demonstrates that transformer-based SLTT architectures, when designed with hierarchical alignment and weak, text-derived pseudo-gloss supervision, are capable of achieving near-gloss-supervised quality, emphatically narrowing the modality gap in sign language translation.
7. Concluding Remarks
Hierarchical Feature Alignment in modern SLTT systems couples transformer-powered visual encoding, segment-level pseudo-gloss supervision, and global contrastive language alignment, yielding empirically superior performance for gloss-free, end-to-end sign language translation. This trajectory marks a critical innovation, alleviating the annotation burden, improving scalability, and validating the integration of self-supervised vision transformers, LLM-based adaptation (LoRA, mBART), and multi-level loss design for real-world and research-grade SLTT systems (Asasi et al., 9 Jul 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free