Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multilevel Sleep Captioning Pipeline

Updated 6 March 2026
  • The paper introduces a novel pipeline that generates detailed captions at channel, event, and global levels from multimodal PSG data.
  • It utilizes the ReCoCa encoder–decoder with interleaved temporal and channel attention to achieve robust cross-modal alignment and generative accuracy.
  • The approach demonstrates high performance in sleep staging, event localization, and signal reconstruction across large-scale clinical datasets.

A multilevel sleep captioning pipeline is a systematic approach for generating natural language descriptions of multimodal sleep physiology data at multiple semantic scales. Developed within the SleepLM framework, this pipeline aligns polysomnography (PSG) signals with free-form textual representations, supporting granular annotation, efficient cross-modal learning, and advanced capabilities in sleep analysis and human-computer interaction (Xu et al., 27 Feb 2026).

1. Dataset Construction and Multilevel Captions

The pipeline leverages a large-scale, heterogeneous PSG corpus sourced from five National Sleep Research Resource (NSRR) cohorts (SHHS, MrOS, CFS, CCSHS, WSC). The dataset comprises roughly 100,000 hours of PSG from over 10,000 unique subjects, spanning more than 12,000 nights. Each PSG record employs a standardized 12-channel montage integrating:

  • Brain (EEG/EOG): C3–A2, C4–A1, E1–A2, E2–A1
  • Respiratory: Thoracic effort, abdominal effort, airflow
  • Cardiac: Single-lead ECG, heart rate, SpO₂
  • Somatic: Chin EMG, body position

The captioning framework operates on three primary semantic levels:

  1. Channel-level ("low-level") captions:
    • Extracted per 30 s epoch, using clinical features such as EEG band powers, heart-rate statistics, respiratory rate/variability, and EMG burst metrics
    • Rendered into modality-specific natural language templates for independent description
  2. Local-level ("event") captions:
    • Capture transient annotated events (e.g., heart-rate accelerations, oxygen desaturations, arousals, apneas), detected by trend or peak algorithms
    • Each event includes precise onset/offset (e.g., "central apnea from 5.2 s to 13.7 s")
  3. Global-level ("epoch summary") captions:
    • Per-epoch semantic summaries: categorical sleep stage (Wake, N1, N2, N3, REM) and autonomic descriptors like "elevated sympathetic tone"
    • Some descriptors depend on withheld signals (e.g., mean SpO₂) to promote inference under partial observation

Full-night summaries are derived by aggregating sliding 30 s epoch-level captions, enabling calculation of key indices such as Apnea–Hypopnea Index (AHI), Wake After Sleep Onset (WASO), and Sleep Efficiency.

2. Raw-Signal Preprocessing and Epoch Structuring

Raw PSG signals undergo a sequence of preprocessing operations:

  • Segmentation: Continuous recordings are split into non-overlapping 30 s epochs (1920 samples at 64 Hz).
  • Channel Consistency: All data are zero-padded to maintain a fixed 12-channel structure, accommodating missing modalities.
  • Normalization and Resampling: Signals are resampled to 64 Hz and z-score normalized per night. Respiratory belts receive region-dependent normalization.
  • Artifact Rejection: Epochs identified as extreme noise or non-wear periods, often at the start or end of recordings, are trimmed.

This canonicalization permits robust multichannel modeling and consistent downstream captioning.

3. ReCoCa Cross-Modal Encoder–Decoder Architecture

The pipeline’s core is the ReCoCa (signal-language encoder–decoder) network, designed for cross-modal alignment and generative capability:

  • Channel-Specific Patch Embedding:

Each channel is partitioned into short time "patches," subjected to per-channel linear embedding. This preserves modality semantics prior to multimodal mixing.

  • Interleaved Temporal and Channel Attention:
    • Temporal self-attention (with sinusoidal rotary positional encoding, RoPE) enables temporal context modeling.
    • Channel self-attention (using channel-wise learnable RoPE) models cross-modality dependencies, reflecting montage topology.
  • Pooling and Embedding Generation:

Post-attention, a CLS pooling mechanism yields a single sleep embedding sRds \in \mathbb{R}^d. Ground-truth captions are processed via a standard transformer text encoder, producing a pooled text embedding vRdv \in \mathbb{R}^d.

  • Decoding Heads:
    • Signal Reconstruction Decoder: A lightweight transformer reconstructs signal patches x^\hat x from ss.
    • Modality-Conditioned Text Decoder: An autoregressive transformer conditioned on a learnable modality token [m][m] (one of {Brain, Respiratory, Cardiac, Somatic}), generating targeted captions for each modality.

During training, each epoch randomly samples one caption level (channel, event, or global), and channel captions are further conditioned on a modality token. This hierarchical supervision enforces semantic alignment at multiple resolutions.

4. Unified Pretraining Objective

Let {(xi,yi)}i=1N\{ (x_i, y_i) \}_{i=1}^N denote a batch of PSG epochs and paired captions. Model optimization uses a jointly weighted loss:

  1. Contrastive Alignment (Symmetric InfoNCE):

Lcontrast=1N[i=1Nlogexp(sim(si,vi)/τ)jiexp(sim(si,vj)/τ)+i=1Nlogexp(sim(vi,si)/τ)jiexp(sim(vi,sj)/τ)]L_{contrast} = -\frac{1}{N} \bigg[\sum_{i=1}^N \log \frac{\exp(\mathrm{sim}(s_i,v_i)/\tau)}{\sum_{j\neq i}\exp(\mathrm{sim}(s_i,v_j)/\tau)} + \sum_{i=1}^N \log \frac{\exp(\mathrm{sim}(v_i,s_i)/\tau)}{\sum_{j\neq i}\exp(\mathrm{sim}(v_i,s_j)/\tau)} \bigg]

  1. Caption Generation (Autoregressive Cross-Entropy):

Lcaption=t=1Tlogpθ(wtw<t,x,[m])L_{caption} = - \sum_{t=1}^T \log p_\theta(w_t \mid w_{<t},\,x,\,[m])

  1. Signal Reconstruction (MSE):

Lrecon=1Ni=1Nxix^i22L_{recon} = \frac{1}{N}\sum_{i=1}^N \lVert\,x_i - \hat x_i\rVert_2^2

The overall objective is L=λcontrastLcontrast+λcaptionLcaption+λreconLreconL = \lambda_{contrast}\,L_{contrast} + \lambda_{caption}\,L_{caption} + \lambda_{recon}\,L_{recon}, with default hyperparameters: vRdv \in \mathbb{R}^d0, vRdv \in \mathbb{R}^d1, vRdv \in \mathbb{R}^d2, vRdv \in \mathbb{R}^d3.

Training adopts teacher forcing for caption generation and utilizes all batch elements as negatives for “sleep-to-text” and “text-to-sleep” contrastive terms.

5. Training Algorithm and Workflow

The model is trained via the following staged operations:

  • For each batch, PSG epochs are encoded (including CLS pooling) and ground-truth captions are independently encoded.
  • Losses are computed for contrastive alignment, autoregressive captioning (with modality token prepended input), and signal reconstruction.
  • Model parameters are updated with combined gradients.
  • Caption generation during evaluation employs greedy or beam decoding.

Negative sampling exclusively utilizes in-batch negatives for contrastive objectives.

Training Workflow Table

Step Operation Notes
Preprocessing Segment, pad, normalize, trim data Standardize and canonicalize input epochs
Encoding ReCoCa (multi-attention, pooling) Extracts vRdv \in \mathbb{R}^d4; text encoding extracts vRdv \in \mathbb{R}^d5
Decoding Transformer decoders for signals and captions Captioning conditioned on vRdv \in \mathbb{R}^d6; teacher forcing used
Optimization Compute vRdv \in \mathbb{R}^d7 and update model All losses jointly weighted

6. Empirical Performance and Capabilities

Extensive experiments demonstrate the efficacy of the multilevel captioning pipeline across classification, retrieval, transfer, and generative tasks:

  • Zero-shot Classification/Regression (SHHS, MrOS, CFS):
    • Sleep staging: AUC = 85.4%, balanced-accuracy = 76.9% (generic LLMs ≈52%)
    • Event localization: IoU = 30.4%, BAcc = 74.3%
    • Heart rate estimation: MAE ≈1.97 bpm, recall = 35.8%
    • SpO₂ estimation: MAE ≈2.24%, recall = 39.1%
    • Channel-stat grounding: sMAPE = 3.15%
  • Cross-modal Retrieval (N=2,000 test pool):
    • Text→Signal R@1 up to 96.1% (SHHS), ≈78.7% (CFS)
    • Signal→Text R@1 up to 96.7% (SHHS), ≈70.0% (CFS)
  • Few-shot Transfer (WSC, 5–50 samples/class):
    • Achieves ~0.90 AUC on 5-class sleep staging with 50 labeled examples (linear probe).
    • Data efficiency surpasses MAE, SimCLR, ViT baselines.
  • Caption Quality:
    • Generated captions are clinically accurate and concise, correctly capturing stage, event type, and temporal boundaries.
    • Outperforms fine-tuned vision-language backbones and general-purpose LLMs on both correctness and localization sensitivity.

The SleepLM pipeline exhibits additional capabilities such as language-guided event localization, targeted insight generation, and zero-shot task generalization (Xu et al., 27 Feb 2026).

7. Impact and Significance

The multilevel sleep captioning pipeline represents a substantial advancement in the integration of multimodal PSG and natural language for sleep research and clinical applications. By generating hierarchical, contextually rich descriptions at channel, event, and global levels, the method enables human-interpretable, adaptable analysis. The unified ReCoCa pretraining objective fosters robust cross-modal representations, while the end-to-end workflow supports high performance on both standard and novel tasks. The large-scale, language-grounded sleep dataset further establishes a foundation for future research in data-driven sleep interpretation and cross-modal neural architectures (Xu et al., 27 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multilevel Sleep Captioning Pipeline.