WLASL: Word-Level ASL Video Dataset
- WLASL is a large-scale benchmark for isolated sign language recognition, offering over 21,000 videos spanning roughly 2,000 unique ASL glosses.
- The dataset uses a rigorous pre-processing pipeline including frame sampling, resizing, and augmentation to ensure consistent video quality and comparability.
- WLASL enables evaluation of both holistic visual and pose-based models, with Transformer approaches achieving up to 75.58% Top-1 accuracy on its subsets.
The Word-Level American Sign Language (WLASL) video dataset is a large-scale benchmark for isolated sign language recognition, specifically targeting the task of identifying American Sign Language (ASL) lexical items (glosses) from video. Designed to address the limitations of prior datasets characterized by small vocabularies and restricted variability, WLASL provides over 21,000 videos spanning approximately 2,000 unique ASL words or glosses, performed by multiple signers in unconstrained environments. This makes it the largest publicly available resource for research in word-level sign language recognition and model benchmarking, facilitating the development and comparison of novel machine learning approaches for this task (Li et al., 2019).
1. Dataset Composition and Scope
WLASL is structured into several subsets based on vocabulary size, with the most comprehensive, WLASL2000, encompassing around 2,000 words and over 21,000 video samples. Each clip represents a distinct instance of an ASL sign produced by different signers in varied settings, including both indoor and outdoor scenes. The frequently used subset, WLASL100, consists of the 100 most common glosses and contains 2,038 video samples, with individual words represented by 18 to 40 samples, the median per-gloss count being approximately 20 (Brettmann et al., 10 Apr 2025).
| Subset | No. of Glosses | No. of Videos | Videos per Gloss (range) |
|---|---|---|---|
| WLASL100 | 100 | 2,038 | 18–40 |
| WLASL2000 | ~2,000 | >21,000 | Variable |
The dataset's breadth in both vocabulary and visual context—arising from its collection across multiple signers and unconstrained backgrounds—supports robust evaluation of recognition techniques and generalization to real-world SLR scenarios.
2. Video Specifications and Pre-processing Pipeline
Original video lengths in WLASL span from 12 to 203 frames (mean ≈62 frames), with source material acquired at varying frame rates. All clips undergo a standardized pre-processing sequence to mitigate variability and facilitate reproducible benchmarking (Brettmann et al., 10 Apr 2025):
- Frame Sampling: Both consecutive sampling (random consecutive F frames) and even sampling (F evenly spaced frames) are utilized, where F is typically 16 or 64. Clips shorter than F frames are padded by duplicating terminal frames.
- Resizing: Video frames are rescaled such that the smaller spatial dimension is at least 226 pixels; larger sides are capped at 256 pixels.
- Color Normalization: Conversion from BGR to RGB color space.
- Data Augmentation (Train Only): Applying a random 224×224 crop and horizontal flip with probability 0.5 across all frames within a clip.
- Testing Augmentation: Center-crop of 224×224 with no flips for evaluation.
The pre-processing pipeline directly follows protocols established in Li et al. 2020, ensuring continuity and comparability in reported results.
3. Annotation Protocols and Dataset Splits
Each video is annotated with its corresponding gloss as defined in the WLASL lexicon. The annotation pipeline leverages explicit assignment of English word labels per instance, but does not detail inter-annotator agreement or explicit quality control in subsequent studies, referencing the original dataset publication for curation specifics (Brettmann et al., 10 Apr 2025).
Dataset splits adhere to a 4:1:1 ratio—train, validation, and test—paralleling the protocol of Li et al. 2020. For WLASL100, this yields approximately 1,359 training, 340 validation, and 340 testing samples. For system comparison, it is standard to merge train and validation partitions for model fitting and retain the test set for both validation and final evaluation. No cross-validation or explicit signer-independent evaluations are described in the referenced studies.
4. Evaluation Metrics and Baselines
Performance in WLASL-based experiments is quantified using the Top-K accuracy metric for K = 1, 5, and 10, formalized as
For example, VideoMAE achieves a Top-1 recognition accuracy of 75.58% on WLASL100, outperforming the best classical I3D+CNN baseline of 65.89% (Brettmann et al., 10 Apr 2025). The adopted training objective for deep models is the multi-class cross-entropy loss:
Where and denote the target indicator and predicted probability for class , respectively.
Two baseline modeling paradigms were introduced in the original WLASL publication: (i) holistic visual appearance-based models, and (ii) 2D human pose-based approaches. A further advance introduced a Pose-based Temporal Graph Convolutional Network (Pose-TGCN) to simultaneously capture spatial and temporal dependencies in signer kinematics (Li et al., 2019).
5. Challenges in Word-Level ASL Recognition
Analysis of WLASL highlights several domain-specific challenges:
- Signer Variability: Differences in hand size, signing speed, and personal style introduce intra-class variance.
- Environmental Diversity: The dataset comprises unconstrained camera viewpoints, lighting changes, and background clutter.
- Data Scarcity: The limited number of samples per class (typically 18–40) increases overfitting risk, especially for deep models.
- Lexical Similarity: Semantically or kinematically similar signs (e.g., “think” vs. “know”) require fine-grained spatiotemporal model acuity.
Transformer-based models, such as ViViT and VideoMAE, address some of these obstacles. Global self-attention allows robust exploitation of spatiotemporal dependencies and selective focus on discriminative regions (primarily hands and upper body), reducing susceptibility to irrelevant background information. The tube-based masking and high-ratio reconstruction strategies of VideoMAE further enhance robustness to missing data and natural video artifacts (Brettmann et al., 10 Apr 2025).
6. Role of WLASL in Research and Benchmarking
WLASL serves as a standard benchmark for isolated word-level ASL recognition, enabling rigorous comparison across diverse recognition paradigms. Fine-tuning Transformer architectures (TimeSformer, VideoMAE) on WLASL100 has demonstrated that models can achieve high recognition accuracy (up to 75.58% Top-1) despite modest dataset sizes. Consistent split and pre-processing protocols facilitate reproducible evaluation and historical comparability, solidifying WLASL's status as the de facto benchmark for isolated ASL word recognition research (Brettmann et al., 10 Apr 2025).
A notable implication is the centrality of WLASL for driving advances in video-based SLR algorithms, fostering both methodological progress and broad accessibility of evaluation data within the research community.