SLTT: Sign Language Translation Transformer

Updated 31 October 2025

SLTT is a transformer-based model that converts sign language videos into spoken sentences using a hierarchical alignment of frame, segment, and video-level features.
It leverages pseudo-gloss supervision and contrastive video-language objectives to bridge the visual-linguistic modality gap with minimal annotation.
Empirical results on benchmarks like RWTH-PHOENIX-Weather 2014T demonstrate improved BLEU-4 scores and reduced trainable parameters compared to earlier gloss-free models.

A Sign Language Translation Transformer (SLTT) is a neural architecture based on the transformer paradigm specifically designed to convert sign language videos into spoken language sentences, typically in a gloss-free, end-to-end, and annotation-light regime. The focus is on leveraging the hierarchical linguistic structure of sign language and bridging the visual-linguistic modality gap, with minimal or no reliance on explicit gloss annotations. Recent research exemplified by (Asasi et al., 9 Jul 2025) has established new state-of-the-art methods for this task by incorporating hierarchical feature alignment, pseudo-gloss supervision, and contrastive video-language objectives.

1. Definition and Hierarchical Feature Alignment

In the SLTT framework, sign language videos are first represented at different linguistic levels:

Frame Level: Fine-grained visual tokens, each representing a snapshot in the video sequence.
Segment Level: Groups of frames corresponding approximately to gloss-like or sign-morpheme units, identified by temporal segmentation.
Video Level: The global, sentence-level semantic embedding that summarizes the entire signing event.

The central architectural principle is explicit hierarchical pretraining in which features are extracted and aligned at both segment and video levels:

At the segment level, visual features are aligned to pseudo-glosses—automatically derived, lemmatized units from the spoken sentence via POS-tagging and embedded with fastText. Soft, weakly supervised alignment is achieved by maximizing cosine similarity between each segment's projected features and pseudo-gloss prototypes, using a cross-entropy indicator loss.
At the video level, global visual embeddings are contrastively aligned to spoken language sentence embeddings, using a CLIP-style bidirectional contrastive loss that ensures tight semantic coupling between modalities.

Both levels are incorporated into the total pretraining loss: $\mathcal{L}_{\text{pre-train}} = \mathcal{L}_{\text{align}} + \lambda\mathcal{L}_{\text{psp}}$ where $\mathcal{L}_{\text{psp}}$ is the pseudo-gloss prototype loss and $\lambda$ regulates their balance.

2. Transformer-based and Hybrid Architectural Components

Modern SLTT systems leverage specialized transformer components:

Visual Encoder: Vision Transformer (ViT-S/14 from DinoV2) processes raw video frames, extracting semantically rich representations using learned global self-attention. Frame features are projected and batch-normalized.
Temporal Encoder: Spatio-temporal transformer layers aggregate features over time using local self-attention (window size = 7), downsampling for long video management, and rotary position embeddings (RoPE) to account for temporal/spatial structure.
LLM Encoder/Decoder: The architecture utilizes mBART (12-layer transformer) with LoRA for parameter-efficient adaptation, leveraging large language modeling capacity and robust fine-tuning of only adapters and selective encoder layers.

The integration of pre-trained visual and language transformer models ensures both high feature expressivity and fine-grained cross-modal representation learning.

3. Pretraining Regime and Pseudo-gloss Supervision

Pretraining occurs in distinct stages:

Pseudo-gloss Alignment (Segment Level):
- For a segment feature $z'_i$ and pseudo-gloss prototype matrix $P$ :
$s_i = \operatorname{sim}(z'_i, P) = \frac{z'_i \cdot P}{\|z'_i\| \|P\|}$

Segment features are softly assigned to pseudo-glosses using softmax and temperature scaling across both time and prototype dimensions.
The loss is calculated as the BCE between predicted segment-to-gloss assignments and the ground-truth presence indicators (from text-provided pseudo-gloss sets).

Contrastive Video-Sentence Alignment (Video Level):
- Video and sentence representations are average-pooled and contrastively aligned:
$\begin{aligned} \mathcal{L}_{\text{align}} &= -\frac{1}{2B}\left( \sum_{j=1}^{B}\log\frac{\exp(\operatorname{sim}(\tilde{M}_j, \tilde{L}_j)/\tau)}{\sum_{k=1}^{B} \exp(\operatorname{sim}(\tilde{M}_j, \tilde{L}_k)/\tau)} \right.\ &\left. + \sum_{j=1}^B \log\frac{\exp(\operatorname{sim}(\tilde{L}_j, \tilde{M}_j)/\tau)}{\sum_{k=1}^B \exp(\operatorname{sim}(\tilde{L}_j, \tilde{M}_k)/\tau)}\right) \end{aligned}$

This bidirectional framing ensures both modalities are robustly mapped into a joint semantic space.

Ablation reveals that upweighting pseudo-gloss alignment ( $\lambda$ ) linearly increases BLEU-4 scores, confirming the critical value of multi-level guidance.

4. Technical Implementation and Training Efficiency

The SLTT pipeline achieves high empirical efficiency:

Pretraining: Only adapters (LoRA) in mBART and a subset of encoder layers are trained, keeping parameter updates low (15.2M trainable vs. 215.6M in earlier gloss-free models).
Fine-tuning: A lightweight linear mapper post-temporal encoder adapts video features for the LLM decoder.
Optimization: Pretraining spans 100 epochs, then a two-phase fine-tuning: first training only the mapper/decoder, then all trainable parameters.

Features—such as 300d fastText embeddings for roughly 2,322 pseudo-glosses (PHOENIX14T)—are well matched to the visual representation space, and segment-level projections are handled with negligible additional overhead.

5. Empirical Results and State-of-the-Art Impact

Experiments conducted on RWTH-PHOENIX-Weather 2014T (German SLT benchmark) yield the strongest gloss-free, end-to-end results to date:

Model	BLEU-4	ROUGE
GFSLT-VLP	21.44	42.49
Sign2GPT	22.52	48.90
FLa-LLM	—	45.27
(This work)	23.15	49.10

Outperforms previous gloss-free methods on both BLEU-4 and ROUGE, surpassing Sign2GPT and FLa-LLM.
Only the weakly-supervised VAP model (requiring near-gloss annotations) scores higher.
Uses far fewer trainable parameters than computationally intensive baselines, supporting practical deployment and scalability.

6. Comparative and Methodological Significance

Relative to previous SLTT developments:

Earlier models (GFSLT-VLP) lacked explicit fine-grained segment alignment, while Sign2GPT used pseudo-glosses only at the sentence level.
FLa-LLM implemented a two-step curriculum but did not realize hierarchical (segment + video) feature alignment.
The presented method is annotation-light but matches or exceeds their translation scores, sharply reducing the gap with gloss-supervised models without the need for manual gloss annotation.

The approach demonstrates that transformer-based SLTT architectures, when designed with hierarchical alignment and weak, text-derived pseudo-gloss supervision, are capable of achieving near-gloss-supervised quality, emphatically narrowing the modality gap in sign language translation.

7. Concluding Remarks

Hierarchical Feature Alignment in modern SLTT systems couples transformer-powered visual encoding, segment-level pseudo-gloss supervision, and global contrastive language alignment, yielding empirically superior performance for gloss-free, end-to-end sign language translation. This trajectory marks a critical innovation, alleviating the annotation burden, improving scalability, and validating the integration of self-supervised vision transformers, LLM-based adaptation (LoRA, mBART), and multi-level loss design for real-world and research-grade SLTT systems (Asasi et al., 9 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Hierarchical Feature Alignment for Gloss-Free Sign Language Translation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Sign Language Translation Transformer (SLTT).