In-Cabin Driver Monitoring Systems
- In-cabin Driver Monitoring Systems are integrated sensor suites that continuously track driver physiology and behavior to ensure safety.
- They utilize a range of modalities, including visible, IR, radar, and wearables, combined through fusion strategies for robust state estimation.
- Advanced algorithms, from deep learning to classical methods, enable real-time detection of drowsiness, distraction, and other critical driver states.
An in-cabin Driver Monitoring System (DMS) is an integrated suite of sensing, signal processing, and algorithmic classification modules designed to ascertain and continually track the physical, cognitive, and behavioral state of vehicle occupants, with particular emphasis on detecting conditions like distraction, drowsiness, cognitive overload, and seatbelt compliance. These systems address stringent requirements for real-time safety intervention, accuracy, robustness to environmental and behavioral variance, low false alarm rates, and minimal computational and power footprint in production vehicles. DMS research encompasses sensor technologies (visible, IR, neuromorphic, radar, wearables), deep learning and classical machine learning, multi-modal fusion strategies, benchmarking datasets, and system-level design for going beyond detection to trusted driver-vehicle interaction and future agentic or robotic augmentation.
1. Key Sensed Driver States, Indicators, and Measurement Modalities
In-cabin DMS research organizes monitored states into five principal “substates,” each with associated physiological and behavioral indicators, mapped to sensor modalities (Halin et al., 2021):
- Drowsiness: PERCLOS (percentage eyelid closure ≥70%), blink frequency, mean blink duration, EEG theta power, HR/HRV metrics (RMSSD, SDNN), breathing rate, head pose dynamics; measured via IR/RGB camera, PPG/ECG, radar, thermal imaging.
- Mental Workload: HR/HRV (LF/HF ratio), EEG α/θ modulation, pupil diameter, gaze dispersion, SDLP (lane position variance); acquired from wearables, cameras, and vehicle CAN-bus.
- Distraction: Hand positions, gaze direction and entropy, EOR (eyes-off-road) duration, auditory distraction via pupillary and EEG measures; multi-modal acquisition from cameras, microphones, IMUs.
- Emotions (“Stress/Anger”): Facial expressions (AU detection), vocal prosody, HR/HRV, EDA, erratic braking/steering; sensors include cameras, cockpit microphones, wearables.
- Influence (Alcohol/Drugs): HR, facial temperature (thermal IR), BAC sensors, gaze instability, behavioral erraticism.
Table I/II in (Halin et al., 2021) details exhaustive mapping between state, indicator, and sensor (EEG, ECG, PPG, radar, visible/NIR/thermal/neuromorphic cameras, CAN-bus, wearables).
2. Sensing Technologies and Data Acquisition
Modern DMS utilize a synergistic array of sensors (Farooq et al., 2023, Kielty et al., 2023, Tavakoli et al., 2021, Hariharan et al., 2023):
- Optical Sensors
- Visible CMOS: high-res facial detection, gesture, color cues (Farooq et al., 2023).
- Near-Infrared (NIR): robust to low-light, enables pupil tracking, blink, gaze estimation (Kielty et al., 2023, Hariharan et al., 2023).
- Long-Wave IR (LWIR): facial thermal gradients for fatigue/stress.
- Neuromorphic Event Cameras: sub-millisecond brightness-change event stream; superior for blink/yawn/seatbelt detection with >120 dB dynamic range (Kielty et al., 2023, Ryan et al., 2020, Kielty et al., 2023).
- Depth/ToF: 3D face, body pose, hand localization; resilience to occlusion and illumination (Katrolia et al., 2021, Feld et al., 2020).
- Wearables & Radar
- Smartwatches: multi-modal IMU, PPG, HR, ambient light/audio (Tavakoli et al., 2021, Sini et al., 2023).
- Contactless radar: HR/RR via ballistocardiography (Sini et al., 2023).
- Infrastructure
- CAN-bus: steering angle, accelerator/brake, SDLP (Feld et al., 2020, Wang et al., 2024).
- Synchronization and Calibration
- Intrinsics/extrinsics, hardware sync for cross-modal alignment (Farooq et al., 2023, Feld et al., 2020, Katrolia et al., 2021).
- Time-alignment via MQTT/NoSQL/Event driven brokers (Sini et al., 2023).
This sensor constellation enables redundancy, coverage in challenging lighting (night, glare), and multimodal robustness to individual or environmental failure cases.
3. Feature Extraction, Multi-Modal Fusion, and System Architecture
DMS pipelines extract high-level features and employ fusion strategies to improve specificity and reliability (Lin et al., 2024, Farooq et al., 2023, Tavakoli et al., 2021, Sini et al., 2023, Kielty et al., 2023, Hu, 2022):
- Feature Extraction
- Vision: CNN/RNN/LSTM models for facial landmarks, EAR, PERCLOS, gaze classification, hand/action recognition.
- Neuromorphic: event accumulation, MobileNetV2 backbone, self-attention, recurrent head for temporal aggregation (Kielty et al., 2023, Kielty et al., 2023).
- Wearables/Radar: engineered time/frequency features, HRV metrics, Random Forest classifiers (Tavakoli et al., 2021).
- Fusion-specific: time-surface generation for events, joint feature concatenation (Farooq et al., 2023), cross-modality channel-shifting (Lin et al., 2024).
- Fusion Strategies
- Early fusion: concatenation of normalized feature vectors (Sini et al., 2023, Farooq et al., 2023) (e.g., [F_vis; F_NIR; F_LWIR; F_depth]).
- Late fusion: weighted sum of classifier scores or voting (Farooq et al., 2023, Katrolia et al., 2021).
- Context-aware fusion: Bayesian, Markov, or attention-based multi-branch architectures (Riya et al., 2024, Farooq et al., 2023).
- Dual Feature Shift (DFS): channel-reindexing for zero-FLOP cross-modality and temporal shifts, shared ResNet stages for efficiency (Lin et al., 2024).
Such architectures permit real-time, energy-efficient, and robust multi-class behavior/action recognition and physiological state estimation under diverse conditions.
4. Algorithms, Model Architectures, and Training Protocols
DMS employ classical ML, deep learning, vision-LLMs, and mixture-of-experts approaches, supported by significant advances in model architecture and dataset development (Hu, 2022, Kielty et al., 2023, Riya et al., 2024, Cañas et al., 15 Mar 2025, Ortega et al., 2020):
- Deep Learning and Hybrid Models
- CNN/LSTM/GRU: for temporal modeling of sequential vision/event/wearable signals; e.g., bi-LSTM recurrent heads for event-based seatbelt recognition (Kielty et al., 2023), Conv3D+LSTM for intent anticipation (Rong et al., 2020).
- Mixture-of-Experts (VDMoE): spatial/temporal experts, prior-inclusive regularization, task-conditioned gating, pretrained embeddings for rPPG and facial cues in multi-task state estimation (drowsiness, cognitive load, HR, RR) (Wang et al., 2024).
- Vision-LLMs (VLMs): zero/few-shot prompting, CLIP-style embedding, cross-modal alignment for gaze/distraction recognition—latency is prohibitive for real-time without model compression (Cañas et al., 15 Mar 2025).
- Classical Approaches
- Random Forest, SVM, HMM—useful for wearable-IMU features, PPG signal classification, classical time-series prediction of drowsiness, intent (Tavakoli et al., 2021, Sini et al., 2023, Halin et al., 2021).
- Seatbelt-State Recognition
- Event-based CNN with self-attention and bi-directional LSTM (Kielty et al., 2023), feature-based IR/fisheye pipeline (local predictor, global assembler, curve modeling) (Hu, 2022).
- Training and Annotation
- Model surgery (quantization, op replacement), adversarial domain adaptation, cross-entropy/focal loss, multi-task objectives (Hariharan et al., 2023, Katrolia et al., 2021, Tavakoli et al., 2021, Wang et al., 2024).
- Rich benchmarks: DMD (Ortega et al., 2020, Cañas et al., 29 Apr 2025), TICaM (Katrolia et al., 2021), Drive&Act (Lin et al., 2024), synthetic data generation via ESIM/v2e (Kielty et al., 2023, Ryan et al., 2020).
Reported metrics include precision, recall, F1, Top-1 accuracy, latency (often <40 ms/frame for vision modules, sub-ms for event-based), and error measures specific to task (seatbelt, blink, yawn, gaze zone).
5. System Integration, Deployment, and Real-World Constraints
Deployment requirements and architectural solutions include embedded hardware optimization, pipeline latency management, sensor placement, and live-system trustworthiness (Ahsani et al., 26 Dec 2025, Hariharan et al., 2023, Kielty et al., 2023, Cañas et al., 29 Apr 2025):
- Hardware and Latency
- Embedded SoCs: NVIDIA Jetson, TI-TDA4VM, Raspberry Pi, Coral Edge TPU; optimized for INT8 quantized inference, DMA/FPGA acceleration, per-frame latency <60 ms for full pipelines, >30 FPS operational speed (Ahsani et al., 26 Dec 2025, Hariharan et al., 2023).
- Placement and Lighting Robustness
- Rear-view mirror or dashboard mounting for multi-modality coverage; IR illumination and event cameras for resilience to sunlight/night (Kielty et al., 2023, Hariharan et al., 2023).
- Fusion and Decision Modules
- Central DMS ECU aggregates seatbelt, gaze, blink, pose, drowsiness; event-based seatbelt and blink modules operate at >30 Hz for real-time state change detection (Kielty et al., 2023).
- Dual-camera pipelines for occlusion-aware fallback; RGB primary, IR backup during persistent occlusion/low light; region-based gaze, ID, occlusion via EfficientNet/MobileNet features (Cañas et al., 29 Apr 2025).
- Human-Centered and Agentic Intelligence
- Behavioral signals (drowsiness, distraction, engagement) routed to higher-order decision modules for personalized interventions, handover readiness in SAE Level 3/4 vehicles; privacy-first, on-device inference to limit raw video exposure (Ahsani et al., 26 Dec 2025).
- Robustness and Regulation
- Multi-modal redundancy (wearables+radar+cameras) and fused classifier outputs mitigate single-sensor dropout; compliance with EuroNCAP and EU 2019/2144 real-time warning/alert mandates (Sini et al., 2023, Cañas et al., 29 Apr 2025).
6. Benchmarks, Datasets, and Quantitative Performance
Comprehensive, open datasets underpin DMS algorithm development, alongside comparative analysis and benchmarking (Ortega et al., 2020, Katrolia et al., 2021, Lin et al., 2024):
- DMD Dataset: 41 h, 37 drivers, RGB/IR/depth, 3 synchronized views (face, hands, body), 93 classes, extensible VCD annotation; enables ≥90 % single-modal accuracy, 93.7 % multi-modal fusion in real-time (Ortega et al., 2020).
- TICaM Dataset: Real/synthetic sequences, RGB/depth/IR, 8 in-cabin scenarios (driver, passenger, child/infant seats), 20 activities; multi-task segmentation, detection, pose (Katrolia et al., 2021).
- Drive&Act: 9.6 million frames, RGB/IR/depth, 83 fine-grained action classes, vehicle-cabin multi-view (Lin et al., 2024).
- Benchmarks and Evaluation:
- Event-based seatbelt (F1: 0.989 sim, 0.944 real (Kielty et al., 2023)), yawn detection (F1: 95.3 % (Kielty et al., 2023)), occlusion-aware gaze region (86.3 % RGB, 76.8 % IR (Cañas et al., 29 Apr 2025)), wearable activity recognition (F1: 94.55 % (Tavakoli et al., 2021)).
- DFS multi-modality backbone (Top-1 Acc: 77.61 %, latency 28 ms (Lin et al., 2024)).
- Agentic DMS fusion reduces false negatives by 30 %, maintains sub-150 ms end-to-end latency (Riya et al., 2024).
7. Limitations, Open Challenges, and Future Directions
Persistent research gaps and improvement targets include sensor coverage, model generalization, privacy, and explainability (Halin et al., 2021, Cañas et al., 15 Mar 2025, Wang et al., 2024, Lin et al., 2024, Kielty et al., 2023, Kielty et al., 2023):
- Data Diversity and Domain Gap
- Synthetic vs. real events yield domain shift (seatbelt F1 drops from 0.989 to 0.944; yawn F1 95.3 to 90.4 (Kielty et al., 2023, Kielty et al., 2023)).
- Small subject pools; recommended expansion across drivers, seating, occlusion, lighting (Kielty et al., 2023).
- Sensor Limitations
- Wearables rely on proper compliance; radar/camera affected by movement and illumination (Tavakoli et al., 2021, Sini et al., 2023).
- IR accuracy limits under lighting extremes or insufficient pre-trained backbones; occlusion detection coverage (Cañas et al., 29 Apr 2025).
- Fusion and Algorithmic Complexity
- Asynchronous modalities, calibration drift; need lightweight, transformer/spiking approaches for real-time embedded inference (Farooq et al., 2023).
- Semi-supervised and continual learning, dataset expansion, standardized annotation (Katrolia et al., 2021, Ortega et al., 2020).
- Privacy and Ethics
- Camera/wearable monitoring raise GDPR and data security concerns; recommend encryption, edge-only inference, user consent flows (Halin et al., 2021, Sini et al., 2023, Ahsani et al., 26 Dec 2025).
- Explainability and Certification
- Deep-learning methods need interpretable outputs for regulatory approval; EuroNCAP and SAE guidelines require trusted, self-degrading DMS (Cañas et al., 29 Apr 2025).
- Emergent Directions
- Multimodal transformer/attention architectures, RL-based adaptive alerting, model compression, cross-vehicle standards; agentic/robotic DMS for real-time intervention and user profiling (Riya et al., 2024, Lin et al., 2024, Halin et al., 2021).
By integrating redundant, multimodal sensing with advanced feature extraction, fusion, and algorithmic reasoning, in-cabin DMS research is positioned to deliver reliable, real-time driver state estimation and safety monitoring for both current and next-generation vehicles.