Papers
Topics
Authors
Recent
2000 character limit reached

SLTT: Sign Language Translation Transformer

Updated 31 October 2025
  • SLTT is a transformer-based model that converts sign language videos into spoken sentences using a hierarchical alignment of frame, segment, and video-level features.
  • It leverages pseudo-gloss supervision and contrastive video-language objectives to bridge the visual-linguistic modality gap with minimal annotation.
  • Empirical results on benchmarks like RWTH-PHOENIX-Weather 2014T demonstrate improved BLEU-4 scores and reduced trainable parameters compared to earlier gloss-free models.

A Sign Language Translation Transformer (SLTT) is a neural architecture based on the transformer paradigm specifically designed to convert sign language videos into spoken language sentences, typically in a gloss-free, end-to-end, and annotation-light regime. The focus is on leveraging the hierarchical linguistic structure of sign language and bridging the visual-linguistic modality gap, with minimal or no reliance on explicit gloss annotations. Recent research exemplified by (Asasi et al., 9 Jul 2025) has established new state-of-the-art methods for this task by incorporating hierarchical feature alignment, pseudo-gloss supervision, and contrastive video-language objectives.

1. Definition and Hierarchical Feature Alignment

In the SLTT framework, sign language videos are first represented at different linguistic levels:

  • Frame Level: Fine-grained visual tokens, each representing a snapshot in the video sequence.
  • Segment Level: Groups of frames corresponding approximately to gloss-like or sign-morpheme units, identified by temporal segmentation.
  • Video Level: The global, sentence-level semantic embedding that summarizes the entire signing event.

The central architectural principle is explicit hierarchical pretraining in which features are extracted and aligned at both segment and video levels:

  • At the segment level, visual features are aligned to pseudo-glosses—automatically derived, lemmatized units from the spoken sentence via POS-tagging and embedded with fastText. Soft, weakly supervised alignment is achieved by maximizing cosine similarity between each segment's projected features and pseudo-gloss prototypes, using a cross-entropy indicator loss.
  • At the video level, global visual embeddings are contrastively aligned to spoken language sentence embeddings, using a CLIP-style bidirectional contrastive loss that ensures tight semantic coupling between modalities.

Both levels are incorporated into the total pretraining loss: Lpre-train=Lalign+λLpsp\mathcal{L}_{\text{pre-train}} = \mathcal{L}_{\text{align}} + \lambda\mathcal{L}_{\text{psp}} where Lpsp\mathcal{L}_{\text{psp}} is the pseudo-gloss prototype loss and λ\lambda regulates their balance.

2. Transformer-based and Hybrid Architectural Components

Modern SLTT systems leverage specialized transformer components:

  • Visual Encoder: Vision Transformer (ViT-S/14 from DinoV2) processes raw video frames, extracting semantically rich representations using learned global self-attention. Frame features are projected and batch-normalized.
  • Temporal Encoder: Spatio-temporal transformer layers aggregate features over time using local self-attention (window size = 7), downsampling for long video management, and rotary position embeddings (RoPE) to account for temporal/spatial structure.
  • LLM Encoder/Decoder: The architecture utilizes mBART (12-layer transformer) with LoRA for parameter-efficient adaptation, leveraging large language modeling capacity and robust fine-tuning of only adapters and selective encoder layers.

The integration of pre-trained visual and language transformer models ensures both high feature expressivity and fine-grained cross-modal representation learning.

3. Pretraining Regime and Pseudo-gloss Supervision

Pretraining occurs in distinct stages:

  1. Pseudo-gloss Alignment (Segment Level):

    • For a segment feature ziz'_i and pseudo-gloss prototype matrix PP:

    si=sim(zi,P)=ziPziPs_i = \operatorname{sim}(z'_i, P) = \frac{z'_i \cdot P}{\|z'_i\| \|P\|}

  • Segment features are softly assigned to pseudo-glosses using softmax and temperature scaling across both time and prototype dimensions.
  • The loss is calculated as the BCE between predicted segment-to-gloss assignments and the ground-truth presence indicators (from text-provided pseudo-gloss sets).
  1. Contrastive Video-Sentence Alignment (Video Level):

    • Video and sentence representations are average-pooled and contrastively aligned:

    Lalign=12B(j=1Blogexp(sim(M~j,L~j)/τ)k=1Bexp(sim(M~j,L~k)/τ) +j=1Blogexp(sim(L~j,M~j)/τ)k=1Bexp(sim(L~j,M~k)/τ))\begin{aligned} \mathcal{L}_{\text{align}} &= -\frac{1}{2B}\left( \sum_{j=1}^{B}\log\frac{\exp(\operatorname{sim}(\tilde{M}_j, \tilde{L}_j)/\tau)}{\sum_{k=1}^{B} \exp(\operatorname{sim}(\tilde{M}_j, \tilde{L}_k)/\tau)} \right.\ &\left. + \sum_{j=1}^B \log\frac{\exp(\operatorname{sim}(\tilde{L}_j, \tilde{M}_j)/\tau)}{\sum_{k=1}^B \exp(\operatorname{sim}(\tilde{L}_j, \tilde{M}_k)/\tau)}\right) \end{aligned}

  • This bidirectional framing ensures both modalities are robustly mapped into a joint semantic space.

Ablation reveals that upweighting pseudo-gloss alignment (λ\lambda) linearly increases BLEU-4 scores, confirming the critical value of multi-level guidance.

4. Technical Implementation and Training Efficiency

The SLTT pipeline achieves high empirical efficiency:

  • Pretraining: Only adapters (LoRA) in mBART and a subset of encoder layers are trained, keeping parameter updates low (15.2M trainable vs. 215.6M in earlier gloss-free models).
  • Fine-tuning: A lightweight linear mapper post-temporal encoder adapts video features for the LLM decoder.
  • Optimization: Pretraining spans 100 epochs, then a two-phase fine-tuning: first training only the mapper/decoder, then all trainable parameters.

Features—such as 300d fastText embeddings for roughly 2,322 pseudo-glosses (PHOENIX14T)—are well matched to the visual representation space, and segment-level projections are handled with negligible additional overhead.

5. Empirical Results and State-of-the-Art Impact

Experiments conducted on RWTH-PHOENIX-Weather 2014T (German SLT benchmark) yield the strongest gloss-free, end-to-end results to date:

Model BLEU-4 ROUGE
GFSLT-VLP 21.44 42.49
Sign2GPT 22.52 48.90
FLa-LLM 45.27
(This work) 23.15 49.10
  • Outperforms previous gloss-free methods on both BLEU-4 and ROUGE, surpassing Sign2GPT and FLa-LLM.
  • Only the weakly-supervised VAP model (requiring near-gloss annotations) scores higher.
  • Uses far fewer trainable parameters than computationally intensive baselines, supporting practical deployment and scalability.

6. Comparative and Methodological Significance

Relative to previous SLTT developments:

  • Earlier models (GFSLT-VLP) lacked explicit fine-grained segment alignment, while Sign2GPT used pseudo-glosses only at the sentence level.
  • FLa-LLM implemented a two-step curriculum but did not realize hierarchical (segment + video) feature alignment.
  • The presented method is annotation-light but matches or exceeds their translation scores, sharply reducing the gap with gloss-supervised models without the need for manual gloss annotation.

The approach demonstrates that transformer-based SLTT architectures, when designed with hierarchical alignment and weak, text-derived pseudo-gloss supervision, are capable of achieving near-gloss-supervised quality, emphatically narrowing the modality gap in sign language translation.

7. Concluding Remarks

Hierarchical Feature Alignment in modern SLTT systems couples transformer-powered visual encoding, segment-level pseudo-gloss supervision, and global contrastive language alignment, yielding empirically superior performance for gloss-free, end-to-end sign language translation. This trajectory marks a critical innovation, alleviating the annotation burden, improving scalability, and validating the integration of self-supervised vision transformers, LLM-based adaptation (LoRA, mBART), and multi-level loss design for real-world and research-grade SLTT systems (Asasi et al., 9 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Sign Language Translation Transformer (SLTT).