On-Device Yawning Monitoring

Updated 19 December 2025

On-device yawning monitoring is a system that employs computer vision, neuromorphic, and acoustic methods to detect yawning events in real time on edge devices.
It integrates resource-efficient deep learning models, multi-modal sensor fusion, and temporal smoothing to achieve high accuracy (>94%) and low latency (<20 ms per decision).
This approach is crucial for applications in driver safety and healthcare, enabling privacy-preserving fatigue monitoring while meeting stringent computational constraints.

On-device yawning monitoring refers to the real-time detection, classification, and temporal localization of yawning behaviors using only the computational resources available on edge devices such as smartphones, embedded boards (e.g., Jetson Nano), or dedicated automotive/healthcare platforms. These pipelines leverage computer vision, neuromorphic sensing, or acoustic Doppler techniques to extract yawn events from video or audio streams, quantify their duration, and integrate the results into broader fatigue monitoring or safety systems. Rapid advances in resource-efficient deep learning, robust annotation, and sensor fusion are enabling high-accuracy (>94%) yawning detection at real-time throughput and low power overhead.

1. Sensing Modalities and Problem Definition

On-device yawning monitoring can be grouped into three main categories by input modality:

Video-based Optical Approaches: Standard RGB cameras or webcams capture continuous facial imagery. Deep neural networks process mouth and face regions to infer the open/closed state or yawn occurrence directly from pixel data (Bačić et al., 2020, Mujtaba et al., 12 Dec 2025, Faraji et al., 2021).
Event-based (Neuromorphic) Sensing: Event cameras simulate or directly capture pixel-wise log-intensity changes, emitting “spikes” only when significant motion or appearance shifts occur. Yawning is detected as a dynamic spatial-temporal event pattern in the event-stream (Kielty et al., 2023).
Acoustic Doppler Methods: Smartphones emit inaudible tones (e.g., 20 kHz) via their speakers and analyze reflected signals for tiny frequency shifts caused by jaw movement during yawning, using Fast Fourier Transforms (FFT) and recurrent neural networks to distinguish yawn events (Xie et al., 30 Mar 2025).

All frameworks aim for real-time (≥30 FPS video/audio equivalent) detection on commodity devices, minimal latency (≤20–35 ms per decision), and high classification or event-detection accuracy.

2. Datasets, Annotation Strategies, and Data Quality

Dataset construction is critical for robust on-device yawning monitoring. Most vision pipelines rely on diverse video datasets with fine-grained event-level labels:

Manual and Semi-automated Labeling: Frame-level labels significantly reduce annotation noise compared to coarse video-level labels, as shown in YawDD+, which used a lightweight CNN classifier for initial mouth state labeling and human-in-the-loop correction. Precision of automated pre-labeling was ≈80%, with manual effort reduced by 80%, producing a final corpus of 124,201 frames (24,840 yawn, 99,361 non-yawn) (Mujtaba et al., 12 Dec 2025).
Synthetic and Simulated Event Data: For neuromorphic approaches, video-to-event simulators (e.g., v2e) create event streams from RGB data, enabling model development in the absence of widespread real event camera datasets (Kielty et al., 2023).
Audio Framing: In Doppler systems, labels are defined per 0.25 s segment of audio, mapping phase-drift features and aligning windowed FFT outputs with yawning/non-yawning events (Xie et al., 30 Mar 2025).

The use of accurate, low-noise, frame- or event-level ground truth directly correlates with the superior accuracy and stability of resulting detection networks, as evidenced by gains of +6 percentage points in accuracy and +5 mAP for models trained with YawDD+ compared to noisy, video-level labels (Mujtaba et al., 12 Dec 2025).

3. Architectures and Real-Time On-Device Pipelines

Video-based Classification and Detection

Mouth Region Detection and Classification: Slim CNNs such as MobileNet-V1 (input: 64×64) or MNasNet-A1 (input: 224×224) classify mouth open/closed or yawn/no-yawn per frame, often using softmax outputs with thresholding (τ=0.5) for binary decisions (Bačić et al., 2020, Mujtaba et al., 12 Dec 2025). Model quantization to 8-bit reduces memory footprint by up to 4×, with on-device RAM consumption as low as 1–4 MB per model (Bačić et al., 2020, Mujtaba et al., 12 Dec 2025).
Integrated Face Detection and Localization: Detectors such as YOLOv8/YOLOv11 leverage multi-scale anchor heads for precise mouth/yawn region localization, with face mesh or landmark models extracting spatial context (Mujtaba et al., 12 Dec 2025, Faraji et al., 2021).
Optical Flow and Landmark Features: Some pipelines incorporate head movement, eyelid closure, and multi-part facial landmarks for more comprehensive fatigue assessment (Bačić et al., 2020, Faraji et al., 2021).
Temporal Smoothing and Event Extraction: Yawn time series are generated using moving averages and minimum contiguous open durations (e.g., L_min = 3 frames), extracting onset/offset events and supporting temporal smoothing to suppress spurious activations (Bačić et al., 2020, Mujtaba et al., 12 Dec 2025).

Event-based (Neuromorphic) Pipelines

Event Accumulation and CNN-RNN Backbone: A sequence of 100 event-frames (256×256, ΔT=0.1 s windows) is processed by MobileNetV2, followed by self-attention and a bi-LSTM head, aggregating both spatial and temporal yawning patterns over a 10 s window (Kielty et al., 2023).
Model Footprint: 4M parameters (~16 MB FP32, 4 MB quantized), compatible with embedded GPUs (Kielty et al., 2023).
Temporal Context: The bi-LSTM head aggregates feature trajectories, enabling accurate event-level segmentation in long driving video sequences.

Acoustic Doppler-based Monitoring

Signal Processing Chain: Audio captured at 44.1 kHz is band-pass filtered and downsampled by n=8, focusing on the 19.8–20.2 kHz band and aliasing it to a 1.8–2.2 kHz baseband for efficient FFT computation (2,048-point) (Xie et al., 30 Mar 2025).
Feature Engineering: Discriminative phase coefficients from the aliased band form per-frame input to a two-layer LSTM with softmax output, labeling each 0.25 s frame as yawn/non-yawn in near real time (Xie et al., 30 Mar 2025).

4. Quantitative Performance, Robustness, and Latency

High accuracy and minimal latency are regularly achieved across sensing modalities:

Reference	Method/Model	Accuracy / mAP	FPS or Latency	Platform
(Mujtaba et al., 12 Dec 2025)	MNasNet, YOLOv11	99.34% acc, 95.7 mAP	59.8 FPS (MN), 28 FPS (Y11)	Jetson Nano, FP16, 1.2 GB RAM
(Kielty et al., 2023)	MobNetV2+Attention+LSTM	95.3% F1 (test)	0.044 s/real-time sec	Desktop GPU / Embedded GPU
(Bačić et al., 2020)	MobileNetV1/ResNet	95.0–97.2% acc	5–15 ms/inference	ARM Cortex-A53 / i5 / 940M GPU
(Xie et al., 30 Mar 2025)	2×LSTM (Acoustic)	94.1% acc	<20 ms/frame	ARM Cortex-A53, <50 mW
(Faraji et al., 2021)	YOLOv3+LSTM	91.7% acc	33 ms/frame	GTX 1080Ti/i7, up to 30 FPS

These metrics are achieved even under challenging operating conditions: various head poses (yaw ±30°, pitch ±15°), daylight/shade, glasses, and low-light (drop in mAP <5%) (Faraji et al., 2021). Acoustic systems demonstrate high specificity, with a false positive rate of 2.9% and false negative rate of 3.0% for yawning (Xie et al., 30 Mar 2025).

Smoothing and Temporal Fusion

Temporal smoothing (e.g., requiring ≥2 consecutive yawn detections for alerting) and frame/event sequence buffering are implemented to minimize jitter and false positives. The integration of additional cues (eye closure, head pose) is used to further enhance robustness (Mujtaba et al., 12 Dec 2025).

5. Computational Constraints, On-Device Optimizations, and Resource Usage

Efficient on-device yawning monitoring requires careful hardware adaptation:

Model Compression: Quantization to 8-bit fixed-point, weight pruning (30-50% sparsity), and knowledge distillation enable model sizes as low as 1–4 MB for CNNs and 15 MB for large detectors (Bačić et al., 2020, Mujtaba et al., 12 Dec 2025, Kielty et al., 2023).
Optimized Inference: Use of NEON/SSE/FP16 vector instructions, TensorRT/ONNX engines, and asynchronous multi-threaded pipelines allows real-time throughput (≥30 FPS) on devices such as Jetson Nano and ARM Cortex-A53, with power consumption under 0.3–10 W depending on workload (Bačić et al., 2020, Mujtaba et al., 12 Dec 2025).
Memory Footprint: System and model RAM remains under 1.2 GB for video pipelines and under 50 mW power for all-acoustic processing on smartphones (Mujtaba et al., 12 Dec 2025, Xie et al., 30 Mar 2025).
Latency: End-to-end per-frame or per-event inference times are consistently below hard real-time thresholds: <20 ms (acoustic), 12–15 ms (CNN video), 0.44 s per 10 s window (neuromorphic) (Bačić et al., 2020, Kielty et al., 2023, Xie et al., 30 Mar 2025).

Edge Adaptations

On mobile and deeply embedded hardware, successful strategies include transitioning to ultra-compact models via hardware-aware NAS, downscaling input resolution, employing temporal CNNs or GRUs in lieu of LSTMs, and deploying on NPUs, DSPs, or TPUs (Bačić et al., 2020, Faraji et al., 2021, Mujtaba et al., 12 Dec 2025).

6. Challenges, Limitations, and Integration with Broader Monitoring

Salient challenges documented in current research include:

Non-Yawn Mouth Movements: Speech, coughing, or eating can be misclassified as yawns. Temporal smoothing and fusing with eye closure or head pose can mitigate this (Mujtaba et al., 12 Dec 2025).
Sensor-specific Artifacts: Open-car windows and airflow can introduce Doppler artifacts in acoustic pipelines. Cepstral filtering or narrow-band noise blanking may be required (Xie et al., 30 Mar 2025).
Anatomical Variability: Differences in jaw size and yawning styles necessitate optional on-device personalization, such as quick LSTM fine-tuning using user-specific “silent yawns” at setup (Xie et al., 30 Mar 2025).
Dataset-Model Mismatch: Neuromorphic models may be limited by reliance on simulated rather than true event camera data, suggesting the need for acquisition of native DVS datasets (Kielty et al., 2023).
Resource Constraints: Extreme memory or power restrictions may require further compression, use of simplified models (YOLOv3-tiny, MobileNetV3), or knowledge distillation (Bačić et al., 2020, Faraji et al., 2021, Mujtaba et al., 12 Dec 2025).

7. Applications and Future Directions

On-device yawning monitoring is central to several application domains:

Driver and Operator Monitoring: Detection of fatigue by tracking yawn events with millisecond latency, integrated into DMS (Driver Monitoring Systems) and vehicle safety platforms (Mujtaba et al., 12 Dec 2025, Kielty et al., 2023, Faraji et al., 2021, Xie et al., 30 Mar 2025).
Elderly Care and Ambient Health: Wearable or ambient camera-based systems extract drowsiness-related time series for unobtrusive monitoring of elderly individuals (Bačić et al., 2020).
Cross-modal Monitoring: Combination of video, event, and acoustic modalities offers robustness across operating environments and anatomical variability.
Continual and Personalized Learning: Future directions include on-device model adaptation via federated learning or rapid fine-tuning, sensor and network co-design, and multi-modal fatigue assessment pipelines combining yawn, blink, and pose events (Mujtaba et al., 12 Dec 2025, Kielty et al., 2023, Xie et al., 30 Mar 2025).

Across all these areas, the trend is toward higher-precision frame-level annotation, more compact and hardware-efficient models, and tighter integration with edge AI accelerators—enabling safe, private, and effective monitoring at the point of use, without reliance on cloud resources.