NTHU-DDD: Driver Drowsiness Detection Dataset

Updated 23 March 2026

NTHU-DDD is a comprehensive dataset comprising controlled RGB video and EEG recordings that capture behavioral and physiological signs of driver drowsiness.
It provides multi-modal annotations including binary drowsiness labels, head, eye, and mouth cues, alongside EEG vigilance markers for robust analysis.
The dataset underpins the development of real-time detection systems by supporting diverse environmental conditions, occlusions, and temporal labeling protocols.

The NTHU Driver Drowsiness Detection Dataset (NTHU-DDD) is a suite of landmark datasets developed by National Tsing Hua University's Computer Vision Lab (Taiwan) for the development and benchmarking of driver drowsiness detection algorithms using computer vision and, in a distinct release, EEG recordings. Widely used in the research community, NTHU-DDD provides controlled laboratory video data and, in a complementary dataset, high-density EEG data capturing both behavioral and physiological signatures of fatigue and vigilance declines during simulated driving. The video-based variants have become a de facto benchmark for spatiotemporal modeling of driver facial dynamics under a range of challenging conditions and occlusions, forming the basis of comparative evaluations for numerous state-of-the-art deep learning frameworks.

1. Dataset Design and Variants

NTHU-DDD encompasses several releases differing in sensor modality, scope, and protocols:

RGB Video Dataset: The main NTHU-DDD video corpus, as described in multiple sources (Lakhani, 2022, Yu et al., 2019, Lyu et al., 2018, Zaman et al., 16 Nov 2025, Tüfekci et al., 2022, Shen et al., 2020), consists of video recordings of subjects engaged in simulated driving under controlled laboratory conditions with staged drowsy and non-drowsy behaviors and a wide range of environmental and occlusion scenarios.
EEG Dataset: An independent but related dataset provides 32-channel scalp electroencephalography (EEG) recordings from 27 subjects during a 90-minute sustained-attention highway driving simulation, with precisely timestamped behavioral markers of vigilance (Cao et al., 2018).

Video Dataset Collection Protocol

Data from 18–36 subjects (number depending on dataset release and experiment) of diverse ethnic backgrounds. Gender/age breakdowns are variably specified—e.g., (Lyu et al., 2018) reports 10M/8F in training, 2M/2F in evaluation.
Each subject controls a simulated steering setup (steering wheel, pedals) in a PC-based driving game, seated in a laboratory environment.
The camera setup is a dashboard-mounted, frontal RGB (and sometimes infrared) camera; original releases use a resolution of 640×480 (when specified) at 30 fps (Yu et al., 2019).
Environmental and occlusion factors are varied: illumination (day/night), eyewear (none, glasses, sunglasses), and background (lab wall). Each subject is recorded under all combinations.
Subjects are instructed to enact normal/alert and drowsy states, including specific facial cues: normal driving, slow blinking, yawning, deliberate head nodding (simulated micro-sleep), and laughter (negative control).

EEG Dataset Protocol

Participants operate a motion-platform driving simulator in near-360° VR for 90 continuous minutes of night-time driving. Periodic lane-departure perturbations trigger behavioral corrections, all time-locked to EEG and response events (Cao et al., 2018).
Inclusion criteria: 22–28 years, healthy, free of neurological/psychiatric disorders, normal sleep regimen.

2. Ground-Truth Annotations and Labeling Structure

Video Dataset

Labels and Channels

Binary labels: "drowsy" vs. "non-drowsy" at the frame or image level (Lakhani, 2022, Zaman et al., 16 Nov 2025).
Additional annotation channels: head pose (normal/look aside/nodding), mouth state (normal/yawn/talk-laugh), eye state (normal/sleepy) (Yu et al., 2019, Lyu et al., 2018).
Per-frame class labels are manually annotated, sometimes leveraging "long-term memory"—i.e., labels for drowsy states may smear across several seconds of video (Lyu et al., 2018), leading to temporal imprecision.

Clip-Level Label Aggregation

Many protocols assign labels to temporal clips using majority voting. E.g., a 5-frame clip is "drowsy" if at least 3/5 frames are labeled drowsy (Yu et al., 2019):

$Y = \begin{cases} 1 & \text{if} \; \sum_{t=1}^T y_t \geq 3 \ 0 & \text{otherwise} \end{cases}$

where $y_t$ is the frame-level label; $T=5$ .

Not all releases document precise annotation protocols; details such as PERCLOS thresholds or rater instructions are typically unspecified (Lakhani, 2022, Zaman et al., 16 Nov 2025).

EEG Dataset

Automatic trial/event segmentation by behavioral triggers: deviation onset, response onset, response offset.
Vigilance ground truth inferred behaviorally via reaction time (RT) to lane departure:

$\text{RT} = t_{\text{resp onset}} - t_{\text{dev onset}}$

Extended RT is used as a proxy for driver inattention or drowsiness (Cao et al., 2018).

No concurrent subjective drowsiness ratings; EEG-derived vigilance indices are calculated post hoc (e.g., theta/alpha ratio).

3. Data Organization, Splits, and Quantitative Characteristics

Video Corpus Organization

Attribute	Details
Training subjects	18
Evaluation subjects	4
Test subjects	14 (labels not public)
Scenarios	Day/night × No/Glasses/Sunglasses
Videos/train	360 (18×5 scenarios×4 videos)
Videos/eval	20 (4×5×1)
Average frames/video	~2000 (train), ~8600 (eval)
Total frames/train	722,223
Format	AVI, 640x480, 30fps (when specified)
Per-frame labels	Drowsy, head, eye, mouth, illum/glass
Data split protocol	Official train/eval/test structure

This summary table is drawn from (Yu et al., 2019, Lyu et al., 2018, Shen et al., 2020, Tüfekci et al., 2022). Where not specified, listed attributes are drawn from the closest available public description.

EEG Corpus Organization

27 subjects, 62 × 90-min sessions, 32 EEG channels + vehicle position, sample rate 500 Hz (Cao et al., 2018).
File formats: raw .cnt, preprocessed .set (EEGLAB), separate event logs.

4. Preprocessing and Modeling Pipelines

Image-Based Approaches

Typical steps: face detection/tracking (OpenCV, MTCNN), landmark alignment (Ren et al.), multi-granularity facial patch extraction (eyes, mouth, global face), resizing/normalization (Lyu et al., 2018, Shen et al., 2020).
Illumination normalization: CLAHE applied to each patch to address lighting variation (Shen et al., 2020, Tüfekci et al., 2022).
Clip generation: Sliding windows (e.g., 5/30/48 frames); majority voting for ground-truth label.
Data augmentation: Some studies apply random flips, per-channel normalization; others use raw RGB with minimal augmentation (Lakhani, 2022, Zaman et al., 16 Nov 2025).
For temporal modeling: LSTM/GRU sequencers, 3D CNNs, and transformer-based methods process per-frame embeddings (Lyu et al., 2018, Shen et al., 2020, Lakhani, 2022).

EEG-Based Approaches

Zero-phase FIR bandpass filtering (1–50 Hz), down-sampling, ICA-based artifact rejection, epoch extraction time-locked to events (Cao et al., 2018).
Feature extraction: alpha/beta/theta band-power, spectral ratios, and vigilance indices.

5. Benchmark Results and Comparative Analysis

Reported Results on NTHU-DDD (Video)

Method/Model	Accuracy / AUC	Notable Preprocessing/Architecture	Citation
Multi-granularity CNN+LSTM (MCNN)	90.05%	Multi-scale facial patches + LSTM	(Lyu et al., 2018)
Two-stream Multi-feature Attention	94.46%	CLAHE + per-patch 3D CNNs + SE Attention	(Shen et al., 2020)
DCNN + OpenCV	99.6%	96×96 frame resize, raw face, no alignment	(Zaman et al., 16 Nov 2025)
LSTM Autoencoder (AUC)	0.8740	ResNet-34 + CLAHE + sequence anomaly detection	(Tüfekci et al., 2022)
Vision Transformer (Swin, Video)	44%	CNN feature projection + Swin, no augmentation	(Lakhani, 2022)

Notable, higher model capacity or novel architectures (e.g., transformers) do not guarantee improved accuracy—overfitting can occur when data volume is limited, as observed for Swin Transformer on the drowsiness subtask (Lakhani, 2022).
Methodological variations in face alignment, patch extraction, and temporal context modeling are decisive for robustness under occlusion and illumination variation (Shen et al., 2020, Lyu et al., 2018).

Reported Results on NTHU-DDD (EEG)

Studies using the EEG corpus focus on classification/regression of vigilance states, using event-related potentials, spectral powers, and cross-channel coherence; standardized accuracy metrics are less commonly reported (Cao et al., 2018).

6. Known Limitations, Variability, and Best Practices

Labeling coarseness: Long-term memory annotation introduces temporal lag in labels, hindering evaluation of rapid state changes (e.g., blinking vs. prolonged closure) (Lyu et al., 2018).
Limited demography/documentation: Most releases report minimal participant demographic data; no detailed camera lens/FOV/frame-rate metadata (Lakhani, 2022, Zaman et al., 16 Nov 2025, Yu et al., 2019).
Protocol variance: Number of subjects, amount of footage per condition, and numbers/types of annotated behaviors can vary between different publications using "NTHU-DDD." A plausible implication is that researchers should specify the exact release and split protocol adopted in benchmarking to ensure comparability.
Realism of drowsiness cues: Drowsiness is simulated, not spontaneous, and labels depend on staged behaviors, potentially limiting ecological validity (Zaman et al., 16 Nov 2025, Lakhani, 2022).
Evaluation protocol: Absence of an official cross-validation split for most releases; subject-independent splits are strongly advised to prevent overestimating generalization (Zaman et al., 16 Nov 2025).

7. Applications and Impact in Drowsiness Detection Research

NTHU-DDD's multi-condition, multi-attribute structure has enabled:

Supervised and semi-supervised classification of driver drowsiness, driving extensive progress in spatial–temporal facial analysis (Shen et al., 2020, Lyu et al., 2018, Tüfekci et al., 2022).
Analysis of robustness to illumination and occlusions via per-patch and attention-based architectures (Shen et al., 2020).
Benchmarking anomaly detection frameworks where "normal" only training regimes flag drowsiness as distributional outlier (Tüfekci et al., 2022).
Development and validation of low-latency, real-time drowsiness detection systems suitable for automotive embedded platforms (Zaman et al., 16 Nov 2025).
In the EEG corpus, data-driven modeling of cognitive fatigue, online vigilance tracking, and neuroergonomics (Cao et al., 2018).

Researchers are advised to consult the specific data descriptor and publication that matches their targeted modality and evaluation protocol, given the diversity of annotation pipelines within the broader NTHU-DDD family. Continued harmonization of ground-truth assignment, transparency in protocol reporting, and release of more naturalistic or spontaneous drowsiness footage are open areas for dataset development.