UT OSANet: Multimodal OSA Diagnosis
- The paper presents a hybrid U-Net and transformer architecture that enables high-resolution, event-level detection of OSA using multiple physiological signals.
- It introduces random masked modality combination training to enhance robustness against missing data, ensuring reliability in diverse clinical and home settings.
- UT OSANet achieves impressive performance with sensitivities up to 0.93 and macro F₁ scores up to 0.95 across five independent polysomnography datasets.
UT OSANet is a multimodal deep learning framework developed for high-resolution, event-level diagnosis and classification of obstructive sleep apnea (OSA). It is designed to simultaneously process multiple physiological signal modalities—including electroencephalography (EEG), airflow, and oxygen saturation (SpO₂)—and comprehensively identify respiratory events, including apnea, hypopnea, oxygen desaturation, and arousal. The system incorporates random masked modality combination training to achieve robustness to missing data, enabling deployment across diverse home, clinical, and research settings. Validation of UT OSANet was performed using 9,021 polysomnography (PSG) recordings from five independent datasets, indicating sensitivities up to 0.93 and macro F₁ scores up to 0.95 depending on the scenario (Wang et al., 20 Nov 2025).
1. Model Architecture
UT OSANet’s architecture consists of a cross-modality U-Net backbone integrated with a transformer encoder, culminating in a fusion and event-proposal prediction head.
- Input Representation: The model processes segments where is the number of modalities (EEG, airflow, SpO₂) and is the number of time-samples (here, ). A binary mask indicates which modality channels are present in the input.
- U-Net Backbone: Each modality branch passes through stacked 1D convolutional blocks (Conv1d → BatchNorm → ReLU → MaxPool). Feature maps from the different modalities are concatenated and mixed by convolution at each level. Skip connections retain fine temporal details for downstream reconstruction. The decoder mirrors the encoder, generating a fused temporal feature tensor $V_{\mathrm{U\mbox{-}Net}}\in\mathbb{R}^{D\times T'}$.
- Transformer Encoder: The U-Net output with positional encoding is supplied to a multi-head self-attention transformer stack, enabling modeling of long-range temporal dependencies and cross-modal event couplings. The transformer output is .
- Fusion & Prediction Head: The concatenation $V_{\mathrm{cat}} = [V_{\mathrm{U\mbox{-}Net}}; V_{\mathrm{Trans}}]$ is linearly projected and passed through a sigmoid function to produce class probability maps . Class-specific temporal windows where are identified as event proposals, and Non-Maximum Suppression removes redundant detections.
The network output is a set of predicted event windows giving onset, duration, and class probabilities for {apnea, hypopnea, desaturation, arousal}.
2. Random Masked Modality Combination Training
UT OSANet is trained using a random masked modality combination strategy ("modality dropout"), enhancing its robustness to missing or degraded channels in practical deployments.
- Mask Generation: For each modality , a Bernoulli mask is sampled , with , independently at each batch.
- Application: Input signals are masked as , effectively zeroing out entire modality streams as determined by .
- Training Rationale: This procedure forces the model to learn feature representations that remain discriminative under arbitrary missing signal conditions, allowing the model to operate with varying sets of available modalities (e.g., EEG-only in home environments).
3. Loss Functions and Optimization
Learning is driven by a combination of class-weighted binary cross-entropy and a temporal continuity regularizer:
- Weighted Binary Cross-Entropy:
- Temporal Smoothing Regularizer:
- Total Objective:
Optimization uses the Adam optimizer (initial learning rate , cosine annealed to over 50 epochs, ), with weight decay , gradient clipping (max-norm=5), dropout (0.2–0.3), batch normalization, and residual links.
4. Datasets, Preprocessing, and Experimental Protocol
UT OSANet was trained and evaluated on 9,021 PSGs from five independent cohorts:
| Dataset | N | Application |
|---|---|---|
| MROS | 2,900 | Multi-scenario eval/training |
| SHHS | 4,940 | Multi-scenario eval/training |
| MESA | 940 | Multi-scenario eval/training |
| CFS | 660 | Multi-scenario eval/training |
| HOME-PAP | 131 | External test set |
Signal preprocessing included 100 Hz resampling, Z-score normalization, EEG bandpass (0.5–45 Hz), and removal of noisy recordings. For MROS/SHHS/MESA/CFS: 75% was used for training, 15% for validation, and 10% for testing. HOME-PAP was reserved strictly as a held-out test cohort.
5. Event-Level Prediction and Evaluation Metrics
The system performs event-level annotation using sliding temporal windows and matches these against ground truth using an intersection-over-union (IoU) threshold ( for training, Non-Maximum Suppression with at inference). Evaluation metrics include:
- Sensitivity (Recall):
- Precision:
- F₁ Score:
- Macro-F₁:
$\mathrm{Macro\mbox{-}F}_1 = \frac{1}{K}\sum_{k=1}^K F_{1,k}$
t-SNE visualizations show clustering of detected event epochs and normal epochs, with some overlap in related event types (e.g., apnea–desaturation).
6. Scenario-Based Performance
UT OSANet was validated under three practical scenarios: home-based screening, multi-modal clinical assessment, and research-grade event detection.
- Home-Based Moderate-to-Severe Screening (EEG-only, AHI<15 vs. ≥15):
Accuracies were 97% (MROS), 88% (SHHS), 94% (MESA), 88% (CFS), 77% (HOME-PAP). Macro-F₁ ranged from 0.76 to 0.97.
- Clinical Severity Assessment (EEG + airflow + SpO₂, 4-class AHI):
Accuracies: 90% (MROS), 84% (SHHS), 81% (MESA), 100% for severe (CFS), 76% (HOME-PAP). Macro-F₁ up to 0.90. AHI estimation up to 0.883 (MROS).
- Research-Grade Event Detection (All modalities, event-wise):
At IoU=0.2, F₁ for apnea ≈ 0.82, hypopnea ≈ 0.86, desaturation ≈ 0.82, arousal ≈ 0.85. Overall F₁ scores across datasets.
Per-class event detection sensitivity spanned: apnea (0.78–0.88), hypopnea (0.78–0.89), desaturation (0.79–0.89), arousal (0.81–0.90).
7. Implications and Context
UT OSANet embodies a hybrid U-Net plus transformer paradigm, leveraging random modality masking to enable robust performance in settings with variable data quality and sensor availability. Its event-level annotation granularity supports both research and applied clinical workflows, and its flexible multimodal implementation allows deployment across home-based screening environments and hospital-grade PSG systems. The model’s capacity to learn cross-modal representations and operate under partial channel information demonstrates substantial utility for real-world OSA diagnosis pipelines, as indicated by its validated performance across five large-scale, heterogeneous datasets (Wang et al., 20 Nov 2025).
A plausible implication is that event-level annotation frameworks such as UT OSANet could facilitate enhanced mechanistic studies of the relationship between discrete respiratory events and longer-term health outcomes in sleep medicine.