NTHU-DDD: Driver Drowsiness Dataset

Updated 23 November 2025

NTHU-DDD Dataset is a publicly available video collection designed for benchmarking driver drowsiness detection under both alert and simulated drowsy driving scenarios.
The dataset includes multi-scenario video data with controlled variations in lighting, pose, and behavior, enabling rigorous evaluation of supervised and unsupervised learning methods.
It features detailed frame-level and clip-level annotations processed via face detection and landmark extraction, advancing deep learning model evaluations for road safety.

The National Tsing Hua University Drowsy Driving Dataset (NTHU-DDD) is a publicly available video dataset designed to benchmark the detection of driver drowsiness under diverse, quasi-realistic driving conditions. It features frontal, in-vehicle camera streams capturing fine-grained variations in driver vigilance, pose, and behavior, systematically labeled for supervised and unsupervised machine learning scenarios. NTHU-DDD supports rigorous evaluation of spatial, temporal, and spatiotemporal deep learning models focused on active road safety and driver state monitoring.

1. Dataset Structure and Data Collection

NTHU-DDD comprises multi-scenario video data acquired from volunteer participants simulating both alert and drowsy driving sessions in controlled settings. Recording protocols span:

Participants: Up to 36 unique volunteer drivers, of various demographics, with canonical splits of 18 subjects for training/validation and 4–20 for held-out test in standard usage (Tüfekci et al., 2022, Lyu et al., 2018).
Session Scenarios: Each participant is filmed across five primary conditions: day (with or without glasses), night (near-infrared, with or without glasses), and sunglasses. Scenarios cover both normal (“non-drowsy”) and simulated drowsy states (actors instructed to feign fatigue via yawning, slow blinking, head-nodding).
Camera Setup: All sessions employ a single dashboard- or windshield-mounted camera, typically recording at 30 frames per second. Both visible-light and infrared (for night) modalities are present, but many studies use only one (e.g., only RGB or only IR) (Tüfekci et al., 2022, Shen et al., 2020).
Data Scale: 9.5–10.5 hours of annotated footage (for the 18- or 36-subject settings). Original data are segmented into temporally dense, overlapping fixed-length clips (e.g., 30 or 48 frames per clip), yielding tens of thousands of samples (Lakhani, 2022).

2. Annotation Scheme and Class Definitions

NTHU-DDD provides dense frame- and clip-level annotation to support both classification and temporal localization tasks:

Primary Class Labels: The canonical two-class split is drowsy (1) and non-drowsy (0) at the frame or clip level. Drowsy behaviors are simulated rather than recorded via physiological sensors such as EEG; labels are assigned based on observable cues like eyelid closure, blink frequency, and head pose (Lakhani, 2022).
Auxiliary Labels: Per-frame annotations cover eye status (normal/sleepy), head pose (normal/nodding/looking-aside), and mouth action (normal/yawn/talking). This enables fine-grained, action- or feature-level inference (Lyu et al., 2018, Shen et al., 2020).
Labeling Protocol: In all documented works, manual annotation is emphasized for high reliability; confidence-based aggregation at the clip level is used to control the threshold for assigning normal vs. anomaly class (e.g., ≥ ½ frames normal within a 48-frame window) (Tüfekci et al., 2022).

3. Data Organization, Splitting, and Preprocessing

The dataset is partitioned to facilitate both within- and subject-independent evaluation protocols:

Standard Splits: Core usage involves subject-adversarial partitioning; for instance: 12/6/4 drivers for train/validation/test, with no subject overlap—enabling strict subject-independence in evaluation (Tüfekci et al., 2022). Alternative studies employ simple random clip-level splits without subject holdout, which increases overfitting risk (Lakhani, 2022).
Clip Extraction: Uniformly windowed temporal sampling is applied (e.g., 30 or 48 consecutive frames per clip, stride of 23, sampled at every second or every third frame for temporal redundancy control) (Lakhani, 2022, Tüfekci et al., 2022).
Preprocessing Steps: Face detection (MTCNN or OpenCV), facial landmark extraction, and patch cropping (global face, eyes, mouth, etc.). Image normalization includes resizing (224×224 or 64×64), per-pixel or per-channel normalization, and optional contrast enhancement via CLAHE (Tüfekci et al., 2022, Shen et al., 2020, Lyu et al., 2018).
Augmentation: Many published works lack augmentation (rotation, jitter, flip), but its absence is explicitly cited as a limitation and recommendation for future work (Lakhani, 2022).

Variant	# Subjects	Recording Type	Frame Rate	Typical Split
Standard	18	RGB (day/night)	30 fps	14/1/3 sets train/val/test
Extended	36	IR (day/night)	30 fps	12/6/4 train/val/test

4. Model Usage and Evaluation Protocols

NTHU-DDD is employed to benchmark a range of architectures, from CNN-LSTM hybrids to spatiotemporal transformers and unsupervised anomaly-detection pipelines:

Framewise and Clipwise Training: Models operate either on individual frames (CNN, MCNN) or short temporal clips (LSTM, transformers, 3D-CNNs). Sequence models take as input CNN-based feature maps or raw image crops (Lyu et al., 2018, Lakhani, 2022).
Patch/Feature Granularity: Multi-feature networks extract separate patches for eyes, mouth, and head (in various spatial resolutions), supporting parallel feature streams (RGB and optical flow) and patch-level attention (Shen et al., 2020).
Transformer Architectures: Spatiotemporal attention models (e.g., Video Swin Transformer) employ tubelet embeddings—dividing input clips into 2×4×4×3 cubes transformed into 96-dim tokens for shift-window self-attention. No temporal down-sampling is used prior to attention (Lakhani, 2022).
Anomaly Detection: LSTM autoencoders trained only on normal-class clips reconstruct the input sequence; clips with elevated reconstruction loss are classified as drowsy. Confidence thresholds are tunable per evaluation (Tüfekci et al., 2022).
Metrics: Principal metrics include overall classification accuracy, ROC AUC for detection, and runtime performance in frames per second (fps). Loss functions are cross-entropy for discriminative models and L2-reconstruction for autoencoder-based solutions.

Method	Feature Model	Temporal Model	Accuracy / AUC
MCNN + LSTM	Multi-patch CNN	3-layer LSTM	90.05%
Two-Stream	RGB + Optical Flow	3-net fusion	94.46%
Video Transformer	CNN → Transformer	Spat.-temp.	44% (test acc)
LSTM Autoencoder	ResNet-34 features	LSTM-AE	AUC 0.8740

5. Reported Results and Performance Benchmarks

NTHU-DDD has enabled state-of-the-art benchmarks in both supervised and unsupervised drowsiness detection:

Best-Case Supervised Performance: MCNN+LSTM achieves 90.05% frame-level accuracy, and a multi-feature two-stream network reports 94.46% accuracy on the official evaluation set (four subject holdout), with throughput up to 37–60 fps (Lyu et al., 2018, Shen et al., 2020).
Transformers: Vision transformer models achieve as high as 67% training accuracy but only 44% on the held-out test set, revealing significant overfitting attributable to the absence of subject-independent splits and data augmentation (Lakhani, 2022).
Anomaly Detection: Unsupervised LSTM autoencoders, using strictly subject-independent splits, report a ROC AUC of 0.8740 and 81.6% accuracy at a 50/50 normal-anomalous clip split (Tüfekci et al., 2022).
Impact of Preprocessing: Contrast enhancement (via CLAHE) and feature patch extraction substantially mitigate the impact of varying lighting and occlusion (e.g., sunglasses), and are shown to improve model AUC by over 5% (Tüfekci et al., 2022, Shen et al., 2020).

6. Limitations, Best Practices, and Recommendations

Multiple studies identify and address current dataset and protocol constraints:

Simulated vs. Real Drowsiness: All drowsy states are actor-simulated; there is no EEG, PERCLOS, or other physiological ground truth. This limits the applicability of conclusions to genuine, real-world drowsiness.
Insufficient Data Scale: The dataset size (9.5–10.5 hours, ≤ 36 subjects) is suboptimal for training high-capacity transformers and deep sequence models without overfitting (Lakhani, 2022).
Train/Test Leakage: Clip-level splitting (without holding out subjects) enables memorization of subject-specific facial features; strict subject-independent evaluation is encouraged (Tüfekci et al., 2022, Lakhani, 2022).
Absence of Augmentation: Limited augmentation reduces the diversity of training data, impeding generalization to unseen settings (Lakhani, 2022).
Label Ambiguity: Original frame-level labels, relying on “long-term memory,” can introduce temporal imprecision regarding transition points. Alternative datasets (e.g., FI-DDD) employ instant labeling with tighter thresholds (Lyu et al., 2018).

Suggested improvements include increasing both the number of subjects (≥50), recording more genuine drowsiness and diverse road conditions, collecting aligned physiological ground truth, and merging with other open sets to cover further risk behaviors such as drunk or aggressive driving.

7. Research Impact and Use Cases

NTHU-DDD has established itself as a primary benchmark for multimodal and multigranularity driver state detection research. Key research outcomes include:

Deep Feature Development: Encouraged architectural advances in multi-patch encoding, temporal-action recognition, and spatiotemporal self-attention approaches for behavior detection under adverse conditions.
Temporal Context Modeling: Provided testbeds for long-horizon memory networks (LSTM, transformers), optical flow fusion, and anomaly detection models addressing the temporal ambiguity inherent to driver state transitions.
Standardized Evaluation: Subject-independent splits with per-feature annotation facilitate rigorous, reproducible comparison across methods and encourage improved reporting on generalization.
Broader Relevance: The dataset’s structure and benchmark protocols have influenced the design and evaluation criteria in both academic and industrial driver monitoring and automotive safety research. A plausible implication is that, by addressing current dataset limitations, next-generation driver state corpora will further bridge the gap between laboratory accuracy and real-world reliability.

References:

(Lakhani, 2022, Lyu et al., 2018, Shen et al., 2020, Tüfekci et al., 2022)