Toronto NeuroFace Dataset Overview
- Toronto NeuroFace dataset is a clinically annotated collection of facial video data from ALS, stroke patients, and healthy controls, defining its scope for neurological research.
- It enables quantitative, non-invasive assessment of orofacial motor dysfunction by capturing standardized tasks such as repetitive phonation and static postures.
- The dataset employs precise manual annotation of 68 facial landmarks and rigorous preprocessing, supporting deep-learning models for dysarthria and motor impairment analysis.
The Toronto NeuroFace dataset is a publicly available, clinically acquired resource of annotated facial video data from patients with neurological disorders—primarily amyotrophic lateral sclerosis (ALS) and stroke—alongside healthy controls. It is the first open-access, video-based collection of facial landmark annotations sourced from neurological patients with the explicit aim of supporting deep-learning research for quantitative, non-invasive assessment of orofacial motor dysfunction, particularly in relation to dysarthria and related disorders (Migliorelli et al., 2023, Gomes et al., 2023).
1. Dataset Composition and Demographic Overview
The Toronto NeuroFace dataset comprises 36 individuals, distributed across three groups:
| Cohort | Number of Subjects | Gender (M/F) |
|---|---|---|
| ALS | 11 | 4 / 7 |
| Stroke | 14 | 10 / 4 |
| Healthy Control | 11 | 7 / 4 |
- Age: Explicit ages are undisclosed; healthy participants were selected to match the age profile of pathological groups.
- Totally annotated images: 3,306 frames; partitioned as 1,015 healthy, 920 ALS, and 1,371 stroke.
- Overall gender distribution: 21 males, 15 females.
This cohort composition reflects an emphasis on matching control and disease demographics, but sample sizes, especially for ALS and control cases, are modest (Migliorelli et al., 2023).
2. Acquisition Protocols and Experimental Tasks
- Recording device: Intel® RealSense RGB camera, face distance set at 30–60 cm.
- Sampling: 30 frames per second; image resolution of 640×480, recorded under uniform clinical laboratory lighting with minimal background noise.
- Orofacial tasks: Standardized, clinically relevant tasks include:
- Static facial postures (maximal mouth opening, lip protrusion, lip stretching)
- Repetitive phonation (diadochokinetic sequence “pa-ta-ka”, rapid repetition of /pa/ syllable)
- Additional tasks in ALS work: “kiss” (lip puckering), “blow” (imitating blowing a candle), lip spread, natural rest, and the “BBP” sentence (“Buy Bobby a puppy”) (Gomes et al., 2023).
- Environment: Data collected in a controlled clinical laboratory to optimize lighting and minimize confounds such as variable head pose or occlusions.
Each video session was manually segmented into repetition-level clips according to the task structure, particularly for tasks requiring multiple iterations such as rapid syllable repetition and sentence articulation (Gomes et al., 2023).
3. Annotation Standards and Metadata
- Landmark annotation: 68 two-dimensional facial landmarks per frame, following the dlib-style convention referenced in the literature:
- 17 jawline, 10 eyebrow, 9 nose, 12 eye, and 20 lip/mouth landmarks.
- Bounding box: Manually annotated, tightly enclosing the face in each image.
- Annotation method: Fully manual, frame-by-frame expert annotation for all 3,306 facial frames (Migliorelli et al., 2023).
- Metadata per frame/image: Includes subject identifier, categorical pathology label (ALS, stroke, or control), and task label.
- Coordinate system: Ground-truth (x, y) pixel locations on original RGB frames; depth or 3D data are not provided.
- Task breakdown—ALS identification subset: 921 patient/session clips comprising the following tasks and instance counts:
| Task | ALS Repetitions | Control Repetitions |
|---|---|---|
| SPREAD | 55 | 59 |
| KISS | 59 | 57 |
| OPEN | 54 | 55 |
| BLOW | 31 | 39 |
| BBP | 95 | 111 |
| PA | 100 | 110 |
| PATAKA | 88 | 108 |
4. Preprocessing, Data Splits, and Model Input Conventions
- Frame selection: For cyclic or repetitive movements, three frames per repetition are sampled (onset, peak, and midpoint) to maximize intra-individual kinematic diversity (Migliorelli et al., 2023).
- Data split: 32 subjects (cohort-balanced) are assigned to the train/validation set, and 4 gender-balanced individuals (2 ALS, 2 stroke) form an independent test set.
- Augmentation in neural network training: Random horizontal flipping (probability 0.5) and random adjustment of brightness scaling factor (sampled uniformly from [0.8, 1.2]). No upstream alignment or Procrustes normalization; region proposals and alignment performed dynamically during inference.
- ALS/control identification preprocessing: OpenFace 2.0 is used for automatic face detection, head pose estimation, and 200×200 pixel face cropping (grayscale conversion). Landmark extraction is performed with FAN (Facial Alignment Network), after which 26 anatomically relevant landmarks (lips and jawline) are retained for graph construction (Gomes et al., 2023).
5. Evaluation Metrics and Statistical Performance
- Primary metric: Normalized Mean Error (NME), quantifying average landmark localization error as a percentage of the bounding-box diagonal; computed as:
where and are the true and predicted landmark i coordinates for image k, and is the annotated bounding box diagonal; .
- Reported landmark localization results: Best model (“facial-landmark Mask RCNN”):
| Region | NME (%) |
|---|---|
| All (68) | 1.79 |
| Jaw (17) | 2.62 |
| Eyebrows | 0.02 |
| Nose | 1.55 |
| Eyes | 1.03 |
| Mouth | 1.49 |
Ablation studies (without pretraining or with vanilla Mask RCNN backbone) exhibited higher NMEs (2.7–13.6%). Variance and statistical significance estimates are not provided (Migliorelli et al., 2023).
- ALS identification protocol: The LOSO framework holds out one subject for testing, with majority-vote decision over all per-repetition frame predictions, and separate splits for per-clip and per-subject evaluation (Gomes et al., 2023).
6. Applications in Automated Dysarthria and ALS Analysis
The Toronto NeuroFace dataset is foundational for:
- Benchmarking facial landmark detection models in neurological cohorts, supporting development of CNN (Mask RCNN-based) systems for telemonitoring of dysarthria (Migliorelli et al., 2023).
- Enabling research into computational phenotyping of ALS: Geometry-based “Facial Point Graph” approaches leverage the landmark data to construct Delaunay- and hub-augmented graphs processed by Graph Attention Networks, distinguishing ALS from healthy control based solely on facial motion during clinical tasks (Gomes et al., 2023).
Applied pipelines extract facial action information—in particular, lip and jaw kinematics—shown to be affected in bulbar-onset ALS, and facilitate both frame-based and subject-wise discrimination. Each frame typically passes through landmark extraction, selection, graph embedding, GAT-based processing, and final majority-vote classification schema.
7. Limitations, Biases, and Prospective Extensions
- Sample size: The number of ALS and control subjects remains limited, constraining statistical power and generalizability.
- Acquisition constraints: All data originate from a single device in a homogeneous, well-lit clinical setting; as such, real-world domain shifts including lighting variability, occlusions, and pose diversity are not represented.
- Task repertoire: Only a predefined set of speech and non-speech gestures is included; other orofacial behaviors are absent.
- Age/exclusion: Pediatric subjects and elderly individuals (>80 years) are not included; demographic scope is thus truncated.
- Manual annotation: While presumed accurate, manual labeling is labor-intensive and exemption from inter-rater variability is not proved.
- Metadata gaps: Disease-specific variables such as ALS onset type, duration, and severity gradients are not included in the annotated release (Migliorelli et al., 2023, Gomes et al., 2023).
A plausible implication is that expansion to broader populations, more variable environmental conditions, and incorporation of diverse task types will be necessary to support robust, home-based, and longitudinal neurofunctional assessment.