NTHU Driver Drowsiness Detection Dataset

Updated 19 March 2026

NTHU-DDD is a comprehensive video dataset with precise frame and clip-level annotations for driver drowsiness detection in simulated environments.
It features over 895K frames collected across diverse scenarios with detailed labels for driver state, head pose, and scene context.
The dataset serves as a benchmark for deep learning models, enabling robust evaluations with reported accuracies up to 99.6% using advanced techniques.

The NTHU Driver Drowsiness Detection Dataset (NTHU-DDD) is a large-scale, annotated video corpus curated to enable the development and evaluation of computer vision systems for detecting drowsiness in drivers. Collected by the Computer Vision Laboratory at National Tsing Hua University, it is widely used in academic research on fatigue detection, particularly leveraging deep learning frameworks. NTHU-DDD is characterized by controlled acquisition conditions, rich metadata regarding driver state and scene context, and precise ground-truth labeling at both frame and clip levels, making it a benchmark for visual drowsiness detection approaches (Yu et al., 2019, Shen et al., 2020, Zaman et al., 16 Nov 2025).

1. Dataset Composition

NTHU-DDD comprises video sequences of 36 volunteer drivers representing mixed ethnic backgrounds. Partitioning is performed at the subject level with 18 drivers in the training set, 4 in the evaluation set, and 14 in the unreleased test set (not used in some published work) (Yu et al., 2019). In total, the training set contains 360 videos (18 subjects × 5 scenarios × 4 clips per scenario), amounting to 722,223 frames, which at 30 fps approximates 6.7 hours of footage. The evaluation set provides 20 videos (4 subjects × 5 scenarios × 1 video per scenario) and 173,259 frames (about 1.6 hours at 30 fps). Each clip is estimated to last between 30 and 40 seconds. Another prominent usage reports a total of 66,521 frames with a class balance of 36,030 labeled as “drowsy” and 30,491 as “not drowsy” (Zaman et al., 16 Nov 2025).

2. Recording Protocol and Scenarios

Acquisitions were performed in an indoor driving simulator where the subject operated a car seat, steering wheel, and pedal controls. Video was recorded in both day and night sessions using a fixed-position RGB camera at 640×480 pixels (AVI format). For nighttime, active infrared (IR) illumination was employed, although resolution and frame rate for IR footage are not specified. Subjects were exposed to five environmental scenarios determined by combinations of illumination (day/night) and eyewear (bare face, regular glasses, sunglasses), precisely:

Daytime, bare face
Daytime, regular glasses
Daytime, sunglasses
Nighttime, bare face
Nighttime, glasses (IR visible) (Yu et al., 2019)

Within each scenario, subjects performed a range of behaviors:

Normal facial expression (neutral)
Talking/laughing
Yawning
Slow blink rate
Head nodding
Looking to left/right
Falling asleep

This controlled variation provides a balanced set of challenging cases for algorithmic analysis.

3. Annotation Framework and Label Taxonomy

Ground-truth annotation is specified on a per-frame basis at 1 fps. Each frame receives a binary “drowsy” (e.g., eyes closed, yawning, nodding) or “non-drowsy” (alert) label. Scene-condition labels are encoded as one-hot vectors, adding contextual granularity:

Glasses & illumination: 5-way ([10000], [01000], [00100], [00010], [00001])
Head pose: 3-way ([100]: normal, [010]: look-aside, [001]: nodding)
Mouth activity: 3-way ([100]: normal, [010]: talking/laughing, [001]: yawning)
Eye state: 2-way ([10]: sleepy/closed, [01]: normal/open) (Yu et al., 2019)

For clip-level tasks, videos are segmented into non-overlapping sub-clips of T=5 consecutive frames, and the drowsiness class for each clip is assigned by majority vote (“temporal IoU > 50%”; at least 3/5 frames must agree). Labeling for auxiliary states (eye, mouth, head, scene) is also offered at the frame and clip level (Shen et al., 2020).

4. Data Preprocessing and Augmentation

All frames destined for deep learning models undergo spatial resizing to standardized input shapes (224×224 for generic face detection tasks; other studies use 96×96, 112×112, or 224×224 for cropped patches). Bilinear interpolation is performed via OpenCV. For training, horizontal flipping and multi-scale pyramid Gaussian blurring are applied as augmentations. Evaluation and inference pipelines omit data augmentation (Yu et al., 2019, Zaman et al., 16 Nov 2025). Some works introduce illumination normalization using CLAHE (Contrast Limited Adaptive Histogram Equalization), especially to mitigate variable lighting in day/night scenarios (Shen et al., 2020). Patch localization (for extracting eyes and mouth regions) leverages MTCNN face landmarking.

5. Train-Test Protocols and Evaluation Metrics

Subject-wise train-test splits are standard: training, validation, and test sets contain disjoint sets of subjects to prevent identity leakage. One protocol uses a 14/1/3 split over 18 subjects for train/val/test (Shen et al., 2020). Clips are typically sampled at 30 fps. Multiple works report only using the training and evaluation sets, leaving out the unreleased test set from the original corpus (Yu et al., 2019). Metrics for evaluation include:

Accuracy: $\frac{TP + TN}{TP + TN + FP + FN}$
Precision: $\frac{TP}{TP + FP}$
Recall: $\frac{TP}{TP + FN}$
F1-score: $\frac{2\,\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}$

These are computed at clip or frame granularity. For drowsiness detection, precision, detection rate (recall), and F-measure are standard (Yu et al., 2019, Zaman et al., 16 Nov 2025).

6. Key Attributes, Usage, and Access

Salient attributes include:

Mixed demographic composition (exact gender/age split not specified)
Controlled simulator rig and scenario variation (illumination, eyewear)
Frame-level binary state labeling plus auxiliary labels (head, mouth, eyes, scene)
Trimodal scenario coverage: ambient/daytime, daytime with sunglasses, nighttime with/without IR
Simulated motion cues via steering and naturalistic driver behaviors

The dataset is widely considered a reference benchmark among visual drowsiness detection corpora. Access, documentation, and download are managed through the CVLab at National Tsing Hua University (https://cv.cs.nthu.edu.tw/), with licensing and DOI details controlled at the originating site (Zaman et al., 16 Nov 2025). File format conventions, directory structure, and annotation tool specifics are not described in all literature; users should consult primary documentation for up-to-date access terms.

7. Impact and Benchmarks in the Literature

NTHU-DDD is the central corpus for evaluating a variety of deep learning-based detection architectures, including 3D CNNs, two-stream action recognition frameworks, and patch-based detectors. Benchmark results report a range of performance depending on the model:

Condition-adaptive 3D-CNN: superior to general spatio-temporal representations (Yu et al., 2019)
Two-stream multi-feature networks (patch-based, 3D convolutions, optical flow, CLAHE): up to 94.46% test accuracy (Shen et al., 2020)
DCNN with minimal preprocessing (resized to 96×96): up to 99.6% accuracy, F1-score 1.00 (Zaman et al., 16 Nov 2025)

Ablation studies demonstrate improvements with illumination normalization, multi-region patch fusion, and pretraining of subnetworks. A consistent trend is that subject-wise splits and multi-scenario evaluation remain critical for robust generalization. The dataset serves as a controlled, reproducible testbed for comparative studies on drowsiness detection models.

Attribute	Detail
Subjects	36 (train: 18, eval: 4, test: 14)
Total frames (train+eval)	895,482
Scenarios per subject	5 (illumination × eyewear)
Camera config	RGB @ 640×480 (AVI), IR (night only)
Labeling granularity	Frame-level drowsy/not-drowsy + scene/pose/action
Access	https://cv.cs.nthu.edu.tw/

Markdown Report Issue Upgrade to Chat

References (3)

Drivers Drowsiness Detection using Condition-Adaptive Representation Learning Framework (2019)

Robust Two-Stream Multi-Feature Network for Driver Drowsiness Detection (2020)

Real-Time Drivers' Drowsiness Detection and Analysis through Deep Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NTHU Driver Drowsiness Detection Dataset.

NTHU Driver Drowsiness Detection Dataset

1. Dataset Composition

2. Recording Protocol and Scenarios

3. Annotation Framework and Label Taxonomy

4. Data Preprocessing and Augmentation

5. Train-Test Protocols and Evaluation Metrics

6. Key Attributes, Usage, and Access

7. Impact and Benchmarks in the Literature

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

NTHU Driver Drowsiness Detection Dataset

1. Dataset Composition

2. Recording Protocol and Scenarios

3. Annotation Framework and Label Taxonomy

4. Data Preprocessing and Augmentation

5. Train-Test Protocols and Evaluation Metrics

6. Key Attributes, Usage, and Access

7. Impact and Benchmarks in the Literature

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research