U-Time Model for Sleep Stage Segmentation
- U-Time is a fully convolutional deep learning architecture that segments physiological time-series for sleep stage classification by adapting U-Net with dilated convolutions.
- The model bypasses recurrent networks by using stacked dilated convolutions and multi-scale pooling to aggregate long-range temporal context efficiently.
- Empirical evaluations across diverse EEG datasets show that U-Time achieves high F1 scores with consistent performance and minimal hyperparameter tuning.
U-Time is a fully feed-forward deep learning architecture for physiological time-series segmentation, specifically introduced for automated sleep stage classification. Based on a temporal adaptation of the U-Net convolutional architecture, U-Time directly maps multichannel sequential inputs of arbitrary duration to per-segment class label predictions, achieving state-of-the-art results in a robust, non-recurrent framework. This model addresses key limitations of recurrent neural network-based approaches, such as tuning complexity and lack of robustness across datasets, by leveraging stacked dilated convolutions and multi-scale pooling to model long-range temporal dependencies without recurrence (Perslev et al., 2019).
1. Problem Formulation and Data Representation
Given channels of input (e.g., EEG, EOG, EMG) sampled at rate %%%%1%%%%, U-Time processes temporal windows covering consecutive segments, each of duration samples, where is the chosen segmention frequency (e.g., Hz for 30s sleep staging windows). The raw input is equivalently represented as channels of a 1D signal with length .
The mapping produces class-confidence scores for stages per segment: Internally, dense segmentation is first performed at the original sampling rate (output: scores), then aggregated by non-overlapping mean-pooling over intervals of samples per segment. For segment and class : where are pre-pooled decoder outputs. Softmax normalization produces probabilities for each segment.
2. Architectural Overview
U-Time is a 1D U-Net variant with an encoder-decoder topology supporting multi-resolution temporal context aggregation via skip connections.
- Encoder (contracting path): Four downsampling levels; each block comprises two dilated convolutions with kernel size (dilation yields effective kernel ), batch normalization and ReLU, followed by max-pooling (window sizes ). This enables temporal length reduction by cumulative factor .
- Bottleneck: Following the deepest encoder level, two additional convolutions (same kernel/dilation) generate the deepest feature map.
- Decoder (expanding path): Four upsampling blocks, each performing nearest-neighbor upsampling (by aforementioned window sizes), halving channels via convolution, then concatenating with the corresponding encoder feature map (skip connections), followed by two further convolutions per block.
- Output: A final convolution produces a dense score map, which is pooled and softmax-normalized per segment.
Typical feature map dimensions, for input and 105,000 samples (35 segments × 3,000 samples at 100 Hz):
- Level 0:
- Level 1:
- Level 2:
- Level 3:
- Level 4:
3. Temporal Resolution and Receptive Field
Stacked dilated convolutions and deep pooling yield an extensive temporal receptive field. For kernel size , dilation : Stacking two such convolutions per encoder block and four levels of pooling allows the model’s receptive field to span about 5.5 minutes at 100 Hz (approximately 33,000 samples). This large receptive field is achieved without recurrent connections, enabling the model to aggregate context over minutes of physiological data.
4. Output, Loss Functions, and Optimization
After decoding, per-segment class probabilities are computed by averaging the dense scores in non-overlapping windows: followed by: where are logits from a convolution.
Loss functions:
- Generalized Dice loss (used to mitigate class imbalance):
- Cross-entropy loss (alternative): where is the true one-hot label, .
5. Training Strategy and Hyperparameterization
U-Time is trained with the Adam optimizer (, learning rate ) using a batch-size of windows, each window covering target segments. Class-balanced sampling ensures each batch window contains at least one occurrence of each target class. Early stopping is applied with a patience of 150 epochs. No explicit regularization (dropout or weight decay) is used, with ≈1.2 million trainable parameters. No dataset-specific hyperparameter tuning was required; the architecture and parameters remained unchanged across all datasets.
6. Empirical Evaluation and Benchmarking
Extensive evaluation was performed across seven EEG datasets:
- Sleep-EDF-39, Sleep-EDF-153 (R&K, 100 Hz)
- PhysioNet-2018 (AASM, 200 Hz)
- DCSM (AASM, 256 Hz)
- ISRUC (AASM, 200 Hz)
- CAP (R&K, 100–512 Hz)
- SVUH-UCD (R&K, 128 Hz)
The primary performance metric was the global per-class F1 (Dice) score. Example results: on Sleep-EDF-39, U-Time achieved F1 = [Wake: 0.87, N1: 0.52, N2: 0.86, N3: 0.84, REM: 0.84], with a mean of 0.79; on ISRUC, mean ≈ 0.77, commensurate with human inter-rater reliability (≈0.80 for that dataset). Multi-channel variants (EEG+EOG, EEG+EOG+EMG) yielded further gains for challenging classes like REM.
7. Advantages, Limitations, and Extensions
Advantages
- Feed-forward, fully convolutional architecture circumvents tuning instability of RNN-based models.
- Accepts arbitrary-length input: facilitates full-length PSG inference in a single pass.
- Adaptable output resolution at inference (e.g., per-segment or per-sample labeling).
- Large receptive field without recurrence, robust context aggregation via pooling and dilation.
- Consistent hyperparameterization across heterogeneous datasets.
Limitations and Extensions
- Single-channel models cannot exploit modalities such as EOG or EMG; multi-channel extensions are available.
- Assumes continuous, fixed-length windowing; application to irregular or discontinuous data would require adaptation.
- Potential extensions include attention-based fusion, learned (transposed convolution) upsampling, and adversarial domain adaptation.
U-Time thus constitutes a robust, general fully-convolutional solution for physiological time-series segmentation, demonstrating high empirical performance and user-friendly deployment owing to its minimal requirements for architecture/hyperparameter tuning and its scalability to variable input and output granularities (Perslev et al., 2019).