Papers
Topics
Authors
Recent
2000 character limit reached

Stanford Sleep Bench: SSRL for PSG Data

Updated 17 December 2025
  • Stanford Sleep Bench is a large-scale, openly shared resource with over 17,000 full-night PSG recordings that supports systematic evaluation of self-supervised representation learning methods.
  • The platform covers diverse downstream tasks—from sleep staging to disease prediction—enabling rigorous benchmarking and reproducibility in clinical sleep research.
  • It utilizes a mix of contrastive, generative, and autoencoder approaches with advanced temporal attention pooling, demonstrating notable performance in age estimation and mortality prediction.

Stanford Sleep Bench is a large-scale standardized resource and systematic evaluation platform for developing and benchmarking self-supervised representation learning (SSRL) methods on polysomnography (PSG) data. Polysomnography, the gold standard in sleep analysis, is inherently multimodal, generating high-volume time-series clinical data well-suited for foundation model pre-training. Stanford Sleep Bench addresses two principal barriers in the field: the lack of an openly shared, task-diverse dataset and the absence of a comprehensive evaluation of SSRL approaches across canonical and clinically significant sleep-related downstream tasks. It comprises over 17,000 full-night PSG recordings and includes a suite of benchmark tasks—ranging from sleep staging and apnea diagnosis to age estimation and disease/mortality prediction—enabling systematic comparison of pretraining paradigms and accelerating reproducibility. The platform is accompanied by open-source pretrained weights, pipelines, benchmarks, and evaluation code (Kjaer et al., 10 Dec 2025).

1. Dataset Composition and Preprocessing

The Stanford Sleep Bench dataset consists of 17,467 full-night PSG recordings from 12,794 unique subjects, aggregating approximately 163,650 hours of multimodal physiological signals. The randomized subject-level split comprises 12,952 recordings for pre-training/training (≈121,365 hours), 1,500 for validation (≈13,979 hours), and 3,015 for testing (≈28,306 hours).

Patient demographics are broad, with an overall mean age of 43.3 ± 19.9 years and BMI of 27.9 ± 7.6 kg/m². Sex distribution is 60% male, 40% female. The sample is predominantly White (≈60%), also including Asian, Black, Hispanic/Latino, and other minorities.

The PSG data captures 16 channels across four modalities:

  • Brain activity (BAS): 8 channels (C3-M2, C4-M1, O1-M2, O2-M1, FP1-M2, FP2-M1, E1-M2, E2-M1)
  • Respiratory (RESP): 5 channels (thoracic effort, abdominal effort, nasal airflow, oral airflow, oxygen saturation—SpO₂)
  • Cardiac (EKG): 1 bipolar lead
  • Muscle (EMG): 2 channels (chin and leg EMG)

Signal preprocessing includes uniform resampling to 128 Hz using zero-phase Butterworth low-pass filtering at the Nyquist frequency.

2. Downstream Benchmark Tasks

Stanford Sleep Bench implements a diverse evaluation suite encompassing both canonical and complex clinical prediction tasks:

  1. Sleep-stage classification: Five-class epoch-level annotation (Wake, N1, N2, N3, REM) over 5-second epochs.
  2. Sleep apnea diagnosis: Binary (apnea/no apnea; AHI ≥ 15 events/hour) at the subject level.
  3. Age estimation: Regression on chronological age.
  4. Mortality prediction: Subject-level, all-cause, time-to-event modeling.
  5. Clinical disease prediction: Time-to-event windowed prediction for twelve major diseases using Cox proportional hazards modeling: myocardial infarction, heart failure, atrial fibrillation/flutter, general atherosclerosis, angina, hypertension, hypotension, pulmonary heart disease, ischemic heart disease, chronic kidney disease, type 2 diabetes, and dementia.

3. SSRL Methods and Pre-training Paradigm

All SSRL approaches utilize a shared "patch + transformer" backbone: 5-second modality-specific patches are embedded by lightweight CNNs into 128-dimensional feature vectors, input to a temporal transformer; final modality-level embeddings are concatenated to 512-dimensional representations.

Baseline Representations

  • Time-domain baseline: A single channel per modality is downsampled (25.6 Hz), concatenated to 512 dimensions.
  • Frequency-domain baseline: Discrete Fourier transform (DFT, excluding Nyquist), with within-modality channel averaging and extraction of 64 uniform amplitude+phase pairs, concatenated to form 512-dimensional vectors.

Generative Pre-training

  • Masked Autoencoder (MAE): Implements both time and frequency-domain reconstruction losses.
    • Time: Reconstruct all or only masked time-domain patches.
    • Freq: Reconstruct all or only masked frequency-domain patches; loss combines amplitude and phase MSE after log transformation:
    • Amplitude: Lamp=1KkK(AX(k)AX^(k))2\mathcal{L}_{\mathrm{amp}} = \frac{1}{|\mathcal{K}|} \sum_{k\in\mathcal{K}} (A_X(k) - A_{\hat{X}}(k))^2
    • Phase: Lphase=1KkKminδ{0,±2π}(Xk(X^k+δ))2\mathcal{L}_{\mathrm{phase}} = \frac{1}{|\mathcal{K}|} \sum_{k\in\mathcal{K}} \min_{\delta\in\{0, \pm 2\pi\}} (\angle X_k - (\angle \hat{X}_k + \delta))^2
  • Denoising Autoencoder (DAE): Inputs are corrupted by Gaussian noise (per-channel, selected uniformly within [0.01, 0.3] × max amplitude; 50% channels per segment), forcing reconstruction in time or frequency domain using the MAE decoder/loss setup.

Contrastive Learning

  • CL-Pairwise: Pulls together 300-second mean-pooled embeddings of time-aligned modalities (i, j) within a sample, pushes apart other batch samples; loss:

Li,j,kpair=logexp(sim(xki,xkj)/τ)m=1Nexp(sim(xki,xmj)/τ)\mathcal{L}^{\mathrm{pair}}_{i, j, k} = -\log \frac{\exp(\mathrm{sim}(x_k^i, x_k^j)/\tau)}{\sum_{m=1}^N \exp(\mathrm{sim}(x_k^i, x_m^j)/\tau)}

  • CL-LOO (leave-one-out): Each modality is aligned to the mean embedding of all other modalities, with analogous loss calculation.

All models are pre-trained with Adam optimizer (learning rate 1e-3, batch size 256), using early stopping on a 100-recording validation set. On two A100 GPUs, pre-training typically completes in 15–35 hours, with downstream convergence observed within four epochs.

4. Fine-tuning and Evaluation Procedures

Downstream task architectures are tailored:

  • Patch-level (Sleep staging): The SSRL backbone is frozen. A two-layer transformer (4 heads) and a two-layer bidirectional LSTM are added, ending with a softmax classifier.
  • Full-night tasks (Apnea, age, disease/mortality): The backbone is frozen. A single transformer layer (8 heads) is applied over hourly embeddings, then attention pooling, followed by regression or classification heads.

Losses and key metrics:

  • Sleep staging, apnea diagnosis: Cross-entropy loss; AUROC reporting.
  • Age estimation: Softplus output, MSE (age scaled 0–1), reporting MAE (years).
  • Disease/mortality: Cox proportional hazards loss, reporting Harrell’s C-index.

Summary of Quantitative Results

Task Best Method(s) (Metric) Top Absolute Results
Sleep staging (AUROC) CL-LOO 0.823
Apnea diagnosis (AUROC) MAE(Freq, all) 0.830
Age estimation (MAE) CL-LOO 6.20 years
Disease + mortality (C-index) CL-LOO/CL-Pairwise 0.74 (avg.)

For sleep staging and apnea diagnosis, frequency-domain and contrastive methods are comparably strong (AUROC ≈ 0.80–0.83). For age and disease/mortality, CL approaches (CL-LOO/CL-Pairwise) outperform generative objectives by 4–7% (C-index gain) and exhibit faster convergence during pre-training. Frequency-domain baselines substantially outperform time-domain in apnea (AUROC 0.81 vs. 0.56).

In few-shot conditions, contrastive approaches require less data to attain near-maximal performance: CL-LOO and CL-Pairwise reach ≈95% maximum sleep staging AUROC with only 64 training subjects, whereas generative SSRL needs more data.

5. Comparative Insights and Interpretation

For tasks requiring short-term local physiological patterns (e.g., sleep staging, apnea), both frequency-domain generative (MAE/DAE) and contrastive learning approaches yield strong, similar performance. For complex integrative tasks (age, mortality, disease prediction), contrastive methods leveraging cross-modal relationships and temporal attention pooling outperform. This suggests that contrastive objectives, being explicitly discriminative and able to enforce multimodal/timewise coherence, better capture global, long-range physiological dependencies essential for clinically relevant downstream tasks, compared to generative losses that prioritize short-term reconstruction fidelity.

6. Limitations and Prospective Directions

Key limitations identified include the non-exhaustive exploration of architecture and hyperparameter search spaces, and the restriction to a subset of SSRL paradigms—hybrid or channel-specific objectives are plausible avenues for future study. The generalizability of results beyond the primary cohort (e.g., to diverse clinical populations outside SHHS) also requires further study. A plausible implication is that cross-site validation and adaptation of foundation models will be necessary for translational robustness.

7. Reproducibility and Open Science Resources

All data (PSG recordings), pretrained SSRL model weights, standardized splits, and code for pretraining, fine-tuning, and evaluation are provided for reproducibility:

This open-source resource enables rigorous benchmarking, ablation studies, and catalyzes development of multimodal foundation models for broad sleep science applications, supporting standardized progress for tasks including sleep staging, apnea detection, age-related biomarker discovery, and disease risk prediction (Kjaer et al., 10 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Stanford Sleep Bench.