Romanian Isolated Sign Language Recognition
- RoISLR is a framework that automatically classifies isolated signs in Romanian Sign Language using the standardized RoCoISLR dataset.
- The approach employs advanced video architectures, including transformer-based models and one-shot PoseFormer pipelines, achieving promising top-1 accuracies up to 34.1%.
- Key challenges such as severe class imbalance and long-tail distributions inspire innovations in multimodal strategies, data augmentation, and cross-lingual transfer.
Romanian Isolated Sign Language Recognition (RoISLR) focuses on the automatic classification of isolated signs in Romanian Sign Language (LSR) from video data. As a foundational task for building sign language technologies in low-resource settings, it presents unique challenges stemming from resource scarcity, severe class imbalance, signer and appearance variability, and the need for robust methods transferable across languages. The introduction of the RoCoISLR corpus and the adaptation of one-shot learning pipelines have established systematic benchmarks and new methodological baselines for this domain (Rîpanu et al., 16 Nov 2025, Vandendriessche et al., 27 Feb 2025).
1. Dataset Construction and Properties
RoISLR research has historically lacked standardized, large-scale annotated corpora. The RoCoISLR dataset constitutes the first major resource, aggregating 9,141 raw isolated-sign video clips, consolidated to 5,892 canonical glosses after rigorous cleaning and deduplication. Data were sourced from three open-access collections: DLMG (6,744 videos), PeSemne (1,191 videos), and "Miscellaneous" (1,206 videos). All videos present a single signer against a monochrome background, without overlays or watermarks.
Key dataset processing decisions include:
- Gloss label normalization via variant merging, filtering complex multi-hyphen labels, and high-threshold near-duplicate removal.
- Preprocessing for action recognition backbones: resizing videos to 224×224 pixels (except 336×336 for Uniformer), sampling at 25 fps, temporally truncating/padding to 64 frames, normalization using ImageNet statistics, and provision of JSON/TXT metadata for management of class mappings and splits.
- Class frequency distribution exhibits a severe long-tail. Of 5,892 unique glosses, 67% have only one instance, ∼20% appear twice, and only a handful have more than five samples. For benchmarking, only glosses with at least two samples (1,926 classes) are used, with n – 1 videos for training and one for testing per class.
2. Modeling Approaches and Training Protocols
RoCoISLR benchmarking employs seven state-of-the-art video architectures, all fine-tuned in MMAction2 using pretrained weights (ImageNet, Kinetics-400/710 as applicable):
- I3D: 3D convolutional ResNet-50, inflating 2D filters for spatio-temporal modeling.
- SlowFast: Dual-pathway; ResNet-101 (slow) for context and ResNet-50 (fast) for motion.
- Swin Transformer: Hierarchical transformer using shifted 3D local windows for spatial-temporal attention.
- TimeSformer: Pure transformer with separated spatial and temporal self-attention.
- UniFormer V2: Combines CNN stems with video ViT-style blocks.
- VideoMAE V2: ViT trained via masked autoencoding on video patches.
- PoseConv3D: Graph-based 3D CNN operating over 133 skeleton keypoints (body, face, hands, feet) from MMPose.
Training hyperparameters encompass: 125 fine-tuning epochs, model-specific batch sizes (8–32 samples per GPU), learning rates (1e–3 for RGB models, 5e–4 for PoseConv3D) decayed at epochs 60 and 100, and cross-entropy loss. Augmentation strategies include random flip, cropping, color jittering, erasing, and multi-scale cropping for transformer-based models. The classification loss follows:
where .
3. One-Shot and Cross-Lingual ISLR Pipelines
A complementary direction employs transferable, keypoint-based one-shot recognition with PoseFormer. The pipeline consists of:
- PoseFormer architecture: Processes 2D keypoints per frame (extracted with MediaPipe Holistic), applies temporal convolutions, MLP frame embeddings, further convolution, and multi-head self-attention. Aggregated via time pooling, the resulting vector serves as the sign embedding.
- One-shot retrieval framework: For a support set of dictionary videos, compute and store PoseFormer embeddings. On query, embed the video and compute scaled dot-product similarity with stored embeddings; nearest neighbor determines the predicted gloss.
- Pretraining: Cross-entropy supervised, using large external datasets (ASL Citizen, VGT Corpus), stripping away appearance features for cross-lingual generalizability.
- Quantitative performance: For target vocabularies of size , cross-lingual transfer is feasible. For , Recall@1 ≈ 0.6; for , Recall@1 ≈ 0.4; MRR for large (∼10,000 signs) sets is ≈ 0.5. This approach enables rapid adaptation to RoISLR via a single annotated exemplar per sign, with further improvement from few-shot domain adaptation (Vandendriessche et al., 27 Feb 2025).
4. Evaluation Metrics and Results
Top-k accuracy is the principal evaluation metric:
RoCoISLR benchmarks expose the following comparative results:
| Model | RoCoISLR Top-1 | RoCoISLR Top-5 | WLASL2000 Top-1 | WLASL2000 Top-5 |
|---|---|---|---|---|
| I3D | 24.0 | 32.3 | 32.5 | 57.3 |
| SlowFast | 23.8 | 26.0 | — | — |
| Swin Transformer | 34.1 | 40.7 | — | — |
| TimeSformer | 30.2 | 34.8 | — | — |
| UniFormer V2 | 20.7 | 32.8 | — | — |
| VideoMAE V2 | 23.4 | 32.1 | — | — |
| PoseConv3D | 25.7 | 30.7 | — | — |
Transformer-based methods (Swin Transformer Top-1: 34.1%) consistently outperform convolutional baselines by a margin of up to 10 percentage points. Evaluation on WLASL2000, which features higher per-class sample counts, results in higher accuracy, underscoring the impact of data scarcity and long-tail effects on RoCoISLR.
5. Challenges: Long-Tail Distributions and Low-Resource Regime
RoCoISLR exemplifies the "long-tail" class frequency phenomenon—a small number of frequent signs and a high number of rare or singleton glosses—leading to substantial degradation in generalization, feature space coverage, and bias toward overrepresented classes. 67% of glosses occur only once, and ∼20% only twice; the joint training/test protocol (n – 1:1 split for n ≥ 2) further amplifies data scarcity per class. This distribution is characteristic of low-resource, real‐world sign language corpora (Rîpanu et al., 16 Nov 2025).
Recognized mitigation strategies include:
- Resampling: Over-sampling tail classes, under-sampling head classes.
- Loss adjustment: Class-balanced weighting (e.g., weights ), effective number of samples, or focal losses.
- Few-shot/meta-learning: Prototypical networks, MAML, or similar approaches to improve tail-class recognition.
A plausible implication is that the combination of these methods, potentially augmented with synthetic data or multimodal fusion, will be necessary to approach parity with well-resourced ISLR systems.
6. Recommendations and Future Research Directions
To address RoISLR-specific challenges and advance the field, the following directions have been outlined (Rîpanu et al., 16 Nov 2025, Vandendriessche et al., 27 Feb 2025):
- Signer and context diversification: Increase heterogeneity across age, handedness, and backgrounds to reduce overfitting and improve generalization.
- Multimodal pipelines: Integrate RGB, skeleton, and lip-reading streams to capture full spatial-temporal signatures of signs.
- Domain adaptation: Align data distributions across sources and corpora for robust transfer.
- End-to-end pipelines: Explore sign recognition coupled to gloss-to-text translation using LLMs.
- Synthetic augmentation: Employ GANs or graphics engines to enhance tail-class frequencies.
- Cross-lingual transfer and one-shot frameworks: Leverage pre-trained keypoint embedders like PoseFormer for immediate deployment on new sign languages with minimal annotation cost, and fine-tune on a small labeled set for higher accuracy.
7. System Implementation and Computational Considerations
Resource analysis for both RoCoISLR pipelines and one-shot PoseFormer-based systems includes:
- Pretraining and fine-tuning: On datasets up to 80k videos, 12–24 hours on 4×NVIDIA V100 (32 GB).
- Keypoint extraction: 30 fps on contemporary CPUs/GPUs.
- Inference: Embedding a video (100–150 frames) in 30–50 ms on GPU; 200 ms on CPU. Dense retrieval (N = 104) is feasibly realized via matrix multiplication or FAISS-based ANN, with sub-10 ms latency for moderate vocabulary sizes.
- Scalability: Embedding storage for N = 105 (d = 160) requires ~64 MB. PoseFormer combined with ANN supports up to N ≈ 106 entries.
- Expected performance for Romanian: Recall@1 ≈ 0.50–0.60 for N ≈ 100, declining gradually below 0.4 as dictionary size increases to N ≈ 1,000–10,000, with MRR ≈ 0.5 for large dictionaries (Vandendriessche et al., 27 Feb 2025).
In summary, RoISLR represents a critical testbed for advancing ISLR research in underrepresented sign languages. Current benchmarks and methodologies establish a foundation for systematic evaluation, expose the limitations of conventional deep architectures in the few-shot regime, and motivate adoption of cross-lingual, keypoint-based, and transformer-centric approaches for robust, scalable, and accessible sign language technologies.