Papers
Topics
Authors
Recent
2000 character limit reached

UT OSANet: Multimodal OSA Diagnosis

Updated 27 November 2025
  • The paper presents a hybrid U-Net and transformer architecture that enables high-resolution, event-level detection of OSA using multiple physiological signals.
  • It introduces random masked modality combination training to enhance robustness against missing data, ensuring reliability in diverse clinical and home settings.
  • UT OSANet achieves impressive performance with sensitivities up to 0.93 and macro F₁ scores up to 0.95 across five independent polysomnography datasets.

UT OSANet is a multimodal deep learning framework developed for high-resolution, event-level diagnosis and classification of obstructive sleep apnea (OSA). It is designed to simultaneously process multiple physiological signal modalities—including electroencephalography (EEG), airflow, and oxygen saturation (SpO₂)—and comprehensively identify respiratory events, including apnea, hypopnea, oxygen desaturation, and arousal. The system incorporates random masked modality combination training to achieve robustness to missing data, enabling deployment across diverse home, clinical, and research settings. Validation of UT OSANet was performed using 9,021 polysomnography (PSG) recordings from five independent datasets, indicating sensitivities up to 0.93 and macro F₁ scores up to 0.95 depending on the scenario (Wang et al., 20 Nov 2025).

1. Model Architecture

UT OSANet’s architecture consists of a cross-modality U-Net backbone integrated with a transformer encoder, culminating in a fusion and event-proposal prediction head.

  • Input Representation: The model processes segments xRC×T\mathbf{x} \in \mathbb{R}^{C\times T} where CC is the number of modalities (EEG, airflow, SpO₂) and TT is the number of time-samples (here, T=250s×100HzT=250\,\mathrm{s} \times 100\,\mathrm{Hz}). A binary mask m{0,1}C\mathbf{m} \in \{0,1\}^C indicates which modality channels are present in the input.
  • U-Net Backbone: Each modality branch passes through stacked 1D convolutional blocks (Conv1d → BatchNorm → ReLU → MaxPool). Feature maps from the different modalities are concatenated and mixed by 1×11\times1 convolution at each level. Skip connections retain fine temporal details for downstream reconstruction. The decoder mirrors the encoder, generating a fused temporal feature tensor $V_{\mathrm{U\mbox{-}Net}}\in\mathbb{R}^{D\times T'}$.
  • Transformer Encoder: The U-Net output with positional encoding is supplied to a multi-head self-attention transformer stack, enabling modeling of long-range temporal dependencies and cross-modal event couplings. The transformer output is VTransRD×TV_{\mathrm{Trans}}\in\mathbb{R}^{D'\times T'}.
  • Fusion & Prediction Head: The concatenation $V_{\mathrm{cat}} = [V_{\mathrm{U\mbox{-}Net}}; V_{\mathrm{Trans}}]$ is linearly projected and passed through a sigmoid function to produce class probability maps yR4×T\mathbf{y} \in \mathbb{R}^{4\times T'}. Class-specific temporal windows where yk(t)>τy_k(t)>\tau are identified as event proposals, and Non-Maximum Suppression removes redundant detections.

The network output is a set of predicted event windows (pj,yj)(p_j, y_j) giving onset, duration, and class probabilities for {apnea, hypopnea, desaturation, arousal}.

2. Random Masked Modality Combination Training

UT OSANet is trained using a random masked modality combination strategy ("modality dropout"), enhancing its robustness to missing or degraded channels in practical deployments.

  • Mask Generation: For each modality cc, a Bernoulli mask is sampled mcBernoulli(1ρ)m_c\sim \mathrm{Bernoulli}(1-\rho), with ρ=0.3\rho=0.3, independently at each batch.
  • Application: Input signals are masked as x~=mx\tilde{\mathbf{x}} = \mathbf{m} \otimes \mathbf{x}, effectively zeroing out entire modality streams as determined by m\mathbf{m}.
  • Training Rationale: This procedure forces the model to learn feature representations that remain discriminative under arbitrary missing signal conditions, allowing the model to operate with varying sets of available modalities (e.g., EEG-only in home environments).

3. Loss Functions and Optimization

Learning is driven by a combination of class-weighted binary cross-entropy and a temporal continuity regularizer:

  • Weighted Binary Cross-Entropy:

LBCE=1Ni=1Nk=1Kwk[yi,klogσ(y^i,k)+(1yi,k)log(1σ(y^i,k))]\mathcal{L}_{\mathrm{BCE}} = -\frac{1}{N}\sum_{i=1}^N\sum_{k=1}^K w_k[y_{i,k}\log\sigma(\hat y_{i,k}) + (1-y_{i,k})\log(1-\sigma(\hat y_{i,k}))]

  • Temporal Smoothing Regularizer:

Lsmooth=1N(T1)i=1Nt=1T1y^i,t+1y^i,t1\mathcal{L}_{\mathrm{smooth}} = \frac{1}{N(T'-1)}\sum_{i=1}^N \sum_{t=1}^{T'-1} \| \hat y_{i,t+1} - \hat y_{i,t} \|_1

  • Total Objective:

Ltotal=LBCE+λLsmooth,λ=0.1\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{BCE}} + \lambda \mathcal{L}_{\mathrm{smooth}},\quad \lambda=0.1

Optimization uses the Adam optimizer (initial learning rate 1e41\mathrm{e}{-4}, cosine annealed to 1e61\mathrm{e}{-6} over 50 epochs, β1=0.9, β2=0.999\beta_1=0.9,\ \beta_2=0.999), with weight decay 1e51\mathrm{e}{-5}, gradient clipping (max-norm=5), dropout (0.2–0.3), batch normalization, and residual links.

4. Datasets, Preprocessing, and Experimental Protocol

UT OSANet was trained and evaluated on 9,021 PSGs from five independent cohorts:

Dataset N Application
MROS 2,900 Multi-scenario eval/training
SHHS 4,940 Multi-scenario eval/training
MESA 940 Multi-scenario eval/training
CFS 660 Multi-scenario eval/training
HOME-PAP 131 External test set

Signal preprocessing included 100 Hz resampling, Z-score normalization, EEG bandpass (0.5–45 Hz), and removal of noisy recordings. For MROS/SHHS/MESA/CFS: 75% was used for training, 15% for validation, and 10% for testing. HOME-PAP was reserved strictly as a held-out test cohort.

5. Event-Level Prediction and Evaluation Metrics

The system performs event-level annotation using sliding temporal windows and matches these against ground truth using an intersection-over-union (IoU) threshold (IoU0.2\mathrm{IoU} \ge 0.2 for training, Non-Maximum Suppression with IoU>0.5\mathrm{IoU}>0.5 at inference). Evaluation metrics include:

  • Sensitivity (Recall):

Sensk=TPkTPk+FNk\mathrm{Sens}_k = \frac{\mathrm{TP}_k}{\mathrm{TP}_k + \mathrm{FN}_k}

  • Precision:

Preck=TPkTPk+FPk\mathrm{Prec}_k = \frac{\mathrm{TP}_k}{\mathrm{TP}_k + \mathrm{FP}_k}

  • F₁ Score:

F1,k=2Preck×SenskPreck+SenskF_{1,k} = 2\,\frac{\mathrm{Prec}_k \times \mathrm{Sens}_k}{\mathrm{Prec}_k + \mathrm{Sens}_k}

  • Macro-F₁:

$\mathrm{Macro\mbox{-}F}_1 = \frac{1}{K}\sum_{k=1}^K F_{1,k}$

t-SNE visualizations show clustering of detected event epochs and normal epochs, with some overlap in related event types (e.g., apnea–desaturation).

6. Scenario-Based Performance

UT OSANet was validated under three practical scenarios: home-based screening, multi-modal clinical assessment, and research-grade event detection.

  • Home-Based Moderate-to-Severe Screening (EEG-only, AHI<15 vs. ≥15):

Accuracies were 97% (MROS), 88% (SHHS), 94% (MESA), 88% (CFS), 77% (HOME-PAP). Macro-F₁ ranged from 0.76 to 0.97.

  • Clinical Severity Assessment (EEG + airflow + SpO₂, 4-class AHI):

Accuracies: 90% (MROS), 84% (SHHS), 81% (MESA), 100% for severe (CFS), 76% (HOME-PAP). Macro-F₁ up to 0.90. AHI estimation R2R^2 up to 0.883 (MROS).

  • Research-Grade Event Detection (All modalities, event-wise):

At IoU=0.2, F₁ for apnea ≈ 0.82, hypopnea ≈ 0.86, desaturation ≈ 0.82, arousal ≈ 0.85. Overall F₁ scores [0.83,0.87][0.83,\,0.87] across datasets.

Per-class event detection sensitivity spanned: apnea (0.78–0.88), hypopnea (0.78–0.89), desaturation (0.79–0.89), arousal (0.81–0.90).

7. Implications and Context

UT OSANet embodies a hybrid U-Net plus transformer paradigm, leveraging random modality masking to enable robust performance in settings with variable data quality and sensor availability. Its event-level annotation granularity supports both research and applied clinical workflows, and its flexible multimodal implementation allows deployment across home-based screening environments and hospital-grade PSG systems. The model’s capacity to learn cross-modal representations and operate under partial channel information demonstrates substantial utility for real-world OSA diagnosis pipelines, as indicated by its validated performance across five large-scale, heterogeneous datasets (Wang et al., 20 Nov 2025).

A plausible implication is that event-level annotation frameworks such as UT OSANet could facilitate enhanced mechanistic studies of the relationship between discrete respiratory events and longer-term health outcomes in sleep medicine.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UT OSANet.