NTHU-DDD: Drowsy Driving Video Dataset
- NTHU-DDD is a comprehensive video dataset capturing driver behavior under diverse illumination, eyewear, and head pose conditions in both simulated and real driving environments.
- It features detailed, frame-level annotations for facial regions such as eyes, mouth, and head pose, enabling the development and benchmarking of spatiotemporal deep learning models.
- The dataset fuels advancements in driver monitoring systems by providing rigorous evaluation metrics and supporting the fusion of optical flow, patch-wise features, and temporal modeling.
The National Tsing Hua University Drowsy Driving Dataset (NTHU-DDD) is a multi-condition, temporally annotated video corpus designed to advance research in vision-based driver drowsiness detection. Collected in controlled cockpit and simulated driving environments, NTHU-DDD provides diverse illumination, eyewear, and head pose conditions, offering granular frame-level and region-specific labels. It has facilitated the development of spatiotemporal deep learning architectures and has become a widely used benchmark in the driver monitoring systems (DMS) research community.
1. Dataset Composition and Collection Protocol
NTHU-DDD consists of video recordings acquired from volunteer drivers—most commonly cited as 18–36 participants—under systematic variations of driving scenarios, environmental illumination, and accessory usage. Each session produces continuous RGB video at 30 frames per second. Recordings encapsulate both real-vehicle and simulated driving situations, capturing the driver’s face and upper body from a fixed dashboard-mounted, frontal-facing camera.
Distinct recording conditions are encoded using a one-hot scheme combining illumination and glasses status:
| Condition | Encoding |
|---|---|
| Day, bare face | 10000 |
| Day, prescription | 01000 |
| Day, sunglasses | 00001 |
| Night, bare face | 00010 |
| Night, prescription | 00100 |
Individual clips further sample driver behavior in diverse physical conditions—normal, talking/laughing, yawning, sleepy, nodding, and looking aside. The dataset comprises approximately 9.5–10.5 hours of footage: the training partition alone contains 722,223 frames, and the evaluation set holds 173,259 frames. Video sequences in the canonical split for training, validation, and testing are divided as follows: 18 video sets for training, 1 held out for validation, and 3 for testing, with a further 20 evaluation videos reserved for benchmarking (Shen et al., 2020, Yu et al., 2019, Lakhani, 2022).
2. Annotation Scheme and Labeling Granularity
NTHU-DDD is distinguished by its multi-level annotation protocol. Each video frame receives both a global fatigue status label—{Alert, Drowsy}—and fine-grained meta-labels for critical facial regions:
| Region | States |
|---|---|
| Eyes | {Stillness, Sleepy-eyes/closed} |
| Mouth | {Stillness, Yawning, Talking/Laughing} |
| Head pose | {Stillness, Nodding, Looking Aside} |
Frame-level annotations serve as ground truth for both standalone image-level and clip-wise temporal models. For methods operating on clips (e.g., 5-frame or 30-frame sequences), a positive “drowsy” label is assigned if a temporal IOU threshold (e.g., at least 3/5 frames) is met:
No explicit segment-level annotation is present beyond these windowed aggregations (Yu et al., 2019, Shen et al., 2020).
3. Preprocessing and Feature Extraction Pipelines
Preprocessing relies on robust facial region localization: cascaded MTCNNs are employed to identify driver face, eyes, and mouth. Patches are extracted and resized to 112×112 (eyes, mouth) and 224×224 (head pose). To mitigate illumination variability—including glare, shadows, and occlusions by eyewear—Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied, following:
This step enhances critical features under challenging lighting. Optical flow is computed between consecutive frames to capture temporal dynamics, and both RGB patches and flow maps serve as input to two-stream 3D CNN-based architectures. In some studies, preprocessed clips are passed through a lightweight CNN to yield 1,024-dimensional representations prior to transformer-based attention modeling (Shen et al., 2020, Lakhani, 2022).
4. Sampling, Data Partitioning, and Evaluation Metrics
NTHU-DDD supports both sparse and dense temporal sampling for clip extraction. For a 3-second window, dense sampling selects a frame every 3 frames (30 total), while sparse uses every 10th frame (10 total). Dense sampling reliably yields higher clip-level detection performance (e.g., 92.8% vs. 88.7% accuracy) and is standard in recent pipelines (Shen et al., 2020).
Dataset split strategies include:
- 18 subject/camera sets for training (80%), with held-out sets for validation and testing (Shen et al., 2020)
- Evaluation folds structured to avoid subject leakage
- No k-fold cross-validation reported in canonical usage
Overall system performance utilizes the following metrics:
Additional analysis includes per-scenario F1 scores, particularly to assess robustness across illumination and eyewear strata (Yu et al., 2019).
5. Benchmarking and Model Performance
NTHU-DDD underpins rigorous comparisons of spatiotemporal models for drowsiness detection. Notable methods and results include:
| Method | Accuracy (%) | Notes |
|---|---|---|
| MSTN (LSTM) | 85.52 | Spatiotemporal recurrent baseline |
| DDD (ConvCGRNN) | 84.81 | Convolutional GRU architecture |
| MCNN + LSTM | 90.05 | Multi-granularity, hybrid model |
| DB-LSTM | 93.60 | Dual-branch LSTM |
| Two-Stream Multi-Feature | 94.46 | Multi-patch 3D CNN + fusion, CLAHE |
Ablation studies reveal the effect of temporal modeling (+3.0 pp from optical flow), patch-wise feature fusion (+2.5 pp), CLAHE (+1.1 pp), and pre-training for individual sub-nets (+0.6 pp). Single-frame CNN baselines lag at 87.3% accuracy (Shen et al., 2020).
Transformer models (e.g., Video Swin Transformer), although conceptually potent for spatiotemporal reasoning, underperform on NTHU-DDD due to overfitting: training accuracy ~67%, test accuracy ~44%. Authors attribute these results to limited data, lack of augmentation, and excessive parameterization for dataset scale (Lakhani, 2022).
6. Applicability, Limitations, and Research Directions
NTHU-DDD’s strengths reside in its comprehensive, physiologically meaningful annotation scheme and broad coverage of challenging real-world and simulated driving conditions. Its detailed labeling of facial regions enables model architectures to leverage fine-grained behavioral cues, advancing the design of fusion pipelines that aggregate patch-wise spatiotemporal features.
Key limitations include constrained dataset size relative to the capacity of modern vision transformers, lack of side- or multi-view camera modalities, and the binary global label granularity (drowsy/non-drowsy) for most downstream tasks. Overfitting in high-capacity models and the need for aggressive augmentation or multi-dataset fusion are recurrent points in recent literature. Recommendations include utilization of advanced transformer designs, cross-dataset merging, and further expansion of both subject diversity and behavioral scenarios to support generalization (Lakhani, 2022, Shen et al., 2020).
7. Summary and Impact on Driver Monitoring Research
NTHU-DDD is established as a standard benchmark for driver drowsiness detection via computer vision, informing methodological advances in temporal action detection, attention-based fusion, and condition-adaptive representation learning. Its adoption has guided performance reporting across LSTM-based, CNN+optical flow, and recent attention-based architectures, making it instrumental in the progression of robust, real-world deployable DMS systems (Shen et al., 2020, Yu et al., 2019, Lakhani, 2022).