SLRT: Sign Language Recognition Transformer
- SLRT is a transformer-based model that uses pose triplet tokenization and a coupling d-VAE to convert continuous pose data into discrete semantic tokens.
- A Graph Convolutional Network extracts framewise features which are then modeled by a transformer encoder to capture both short-range and long-range temporal dependencies.
- Empirical results demonstrate that SLRT outperforms existing methods on major SLR benchmarks, highlighting its scalability and precision in sign language tasks.
A Sign Language Recognition Transformer (SLRT) is a transformer-based model engineered to learn and infer structured linguistic concepts from sign language data, leveraging the generalizability and contextual reasoning capabilities of self-attention architectures. SLRT has emerged as a dominant backbone for both isolated and continuous sign language recognition, achieving state-of-the-art performance by tightly coupling spatial pose modeling, discrete semantic tokenization, and contextual sequential modeling.
1. Pose Triplet Tokenization and Semantic Quantization
SLRT-based architectures address the fundamental challenge that pose data—unlike text—lies in a continuous, low-level signal space. To enable transformer-style pre-training, input frames are organized as "pose triplet units," each constisting of upper body pose (), left hand pose (), and right hand pose (). The pose triplet for frame is
To bridge the semantic gap between continuous pose and discrete BERT-style tokenization, a discrete variational autoencoder (d-VAE) is employed as a "Coupling Tokenizer". Each part of the triplet (left hand, right hand, body) is independently quantized by its respective codebook, but the tokenization is jointly optimized to maximize correlation between codebooks, ensuring that the resulting indices preserve inter-part dependencies: where are d-VAE latent projections and are codebook entries. The discrete index tuple per frame forms the "pose pseudo-label."
The coupling tokenizer is optimized with a multi-part loss that combines reconstruction errors for each component and a vector quantization regularization, using the straight-through estimator for gradient support.
2. Framewise Embedding and Temporal Modeling
For each frame, a Graph Convolutional Network (GCN) extracts structured pose features, capturing kinematic relations among joints. The sequence of framewise pose embeddings, enhanced by tokenization, is then supplied to a transformer encoder. The transformer leverages multi-head self-attention to model:
- Internal context: Fine short-range dependencies within a clip or sign.
- External context: Long-range dependencies, including coarticulation or co-occurring signs.
This context modeling, enhanced by position encoding, enables the SLRT to disambiguate subtle distinctions in sign execution that may manifest only in extended spatial-temporal correlations.
3. Masked Unit Modeling Pre-Training
Inspired by BERT's masked LLM, SLRT adopts a Masked Unit Modeling (MUM) task. In each input sequence, a proportion of pose triplet units is masked at random. Masking occurs independently for hands and body—each part has a 50% probability of being masked. The model is trained to reconstruct the discrete pseudo-labels (quantized indices) of the masked units based only on the unmasked context: where the sum decomposes into components for each part.
Unlike classic BERT (which reconstructs natural language tokens), SLRT reconstructs semantic pseudo-labels derived from pose, requiring the tokenizer's quantized indices as the prediction target. Empirically, pre-training with cross-entropy on discrete tokens consistently outperforms regression to original pose vectors.
4. Fine-Tuning and Multi-Modal Adaptation
Post pre-training, the decoder is replaced with a multi-layer perceptron (MLP) prediction head. The transformer encoder is then fine-tuned on labeled SLR datasets, with the output layer supervised for gloss (sign) classification. For greater expressivity, especially in word-level or continuous SLR, fusion with an RGB network is possible ("Ours(+R)" variant); late fusion combines softmax outputs from both pose and RGB streams. All downstream tasks are supervised with cross-entropy.
Ablation results indicate that masking both hand and body channels during pre-training yields the strongest downstream accuracy. Additional gains are realized by scaling pre-training to larger pose datasets.
5. Empirical Performance and Analysis
SLRT-based models, especially those employing coupling tokenization and masked semantic pre-training, set new state-of-the-art results on all major isolated SLR benchmarks:
| Method | MSASL100 | MSASL200 | MSASL1000 |
|---|---|---|---|
| ST-GCN | 59.84 | 52.91 | 36.03 |
| SignBERT | 76.09 | 70.64 | 49.54 |
| BEST (Ours) | 80.98 | 76.60 | 58.82 |
| Ours(+RGB) | 89.56 | 86.83 | 71.21 |
Similar gains are observed in WLASL, NMFs-CSL, and SLR500. The coupled tokenizer is empirically superior to both model-free clustering (e.g., K-means) and uncoupled codebooks. Standard regression-based pre-training underperforms masked discrete reconstruction. Downstream performance scales with pre-training data, with gains saturating beyond a certain sequence volume or dataset complexity.
6. Connection to Broader Research and Methodological Significance
SLRT represents a fusion of BERT-style unsupervised learning, context-sensitive temporal modeling, and domain-specific semantic structuring for embodied sign language data. It aligns with trends in using discrete variational autoencoders for non-text input quantization and with an architectural bias toward modular, interpretable stages: pose estimation, semantic quantization, contextual encoder, and task-specific output.
The advances validate that:
- Pose-based recognition driven by discrete, learned representations can match or outperform RGB-stream approaches, especially when skeletal data is robust.
- Pretext tasks tailored to domain signal structure (i.e., pose quantization + masking) are essential—direct transfer of NLP objectives is suboptimal due to semantic granularity mismatch.
- Coupling between hand and body pose semantics, explicitly modeled, enables more accurate tokenization and downstream contextualization than independent partitioning.
7. Limitiations and Future Directions
While SLRT advances SLR state-of-the-art, its performance may be affected by pose extraction noise, limited availability of high-accuracy hand pose detectors in the wild, and potential information bottlenecks induced by codebook quantization. Current instantiations primarily target isolated SLR—expansion to continuous SLR or end-to-end sign language translation requires integrating temporal segmentation and possibly joint recognition-translation training objectives.
Scaling SLRT for multilingual or continuous sign language, incorporating adaptive tokenization, or integrating complementary modalities (such as facial keypoints or audio cues) constitute active research directions. Moreover, further benchmarking in the presence of adversarial pose occlusions or domain adaptation scenarios is necessary to fully characterize generalization across signer, environment, and capture device.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free