U-Time Model for Sleep Stage Segmentation

Updated 2 February 2026

U-Time is a fully convolutional deep learning architecture that segments physiological time-series for sleep stage classification by adapting U-Net with dilated convolutions.
The model bypasses recurrent networks by using stacked dilated convolutions and multi-scale pooling to aggregate long-range temporal context efficiently.
Empirical evaluations across diverse EEG datasets show that U-Time achieves high F1 scores with consistent performance and minimal hyperparameter tuning.

U-Time is a fully feed-forward deep learning architecture for physiological time-series segmentation, specifically introduced for automated sleep stage classification. Based on a temporal adaptation of the U-Net convolutional architecture, U-Time directly maps multichannel sequential inputs of arbitrary duration to per-segment class label predictions, achieving state-of-the-art results in a robust, non-recurrent framework. This model addresses key limitations of recurrent neural network-based approaches, such as tuning complexity and lack of robustness across datasets, by leveraging stacked dilated convolutions and multi-scale pooling to model long-range temporal dependencies without recurrence (Perslev et al., 2019).

1. Problem Formulation and Data Representation

Given $C$ channels of input (e.g., EEG, EOG, EMG) sampled at rate $S$ , U-Time processes temporal windows covering $T$ consecutive segments, each of duration $i = S/e$ samples, where $e$ is the chosen segmention frequency (e.g., $e=1/30$ Hz for 30s sleep staging windows). The raw input $\mathbf{x} \in \mathbb{R}^{T \times i \times C}$ is equivalently represented as $C$ channels of a 1D signal with length $t = T \cdot i$ .

The mapping $f(\cdot;\theta)$ produces class-confidence scores for $S$ 0 stages per segment: $S$ 1 Internally, dense segmentation is first performed at the original sampling rate (output: $S$ 2 scores), then aggregated by non-overlapping mean-pooling over intervals of $S$ 3 samples per segment. For segment $S$ 4 and class $S$ 5: $S$ 6 where $S$ 7 are pre-pooled decoder outputs. Softmax normalization produces probabilities for each segment.

2. Architectural Overview

U-Time is a 1D U-Net variant with an encoder-decoder topology supporting multi-resolution temporal context aggregation via skip connections.

Encoder (contracting path): Four downsampling levels; each block comprises two dilated convolutions with kernel size $S$ 8 (dilation $S$ 9 yields effective kernel $T$ 0), batch normalization and ReLU, followed by max-pooling (window sizes $T$ 1). This enables temporal length reduction by cumulative factor $T$ 2.
Bottleneck: Following the deepest encoder level, two additional convolutions (same kernel/dilation) generate the deepest feature map.
Decoder (expanding path): Four upsampling blocks, each performing nearest-neighbor upsampling (by aforementioned window sizes), halving channels via convolution, then concatenating with the corresponding encoder feature map (skip connections), followed by two further convolutions per block.
Output: A final $T$ 3 convolution produces a dense $T$ 4 score map, which is pooled and softmax-normalized per segment.

Typical feature map dimensions, for $T$ 5 input and 105,000 samples (35 segments × 3,000 samples at 100 Hz):

Level 0: $T$ 6
Level 1: $T$ 7
Level 2: $T$ 8
Level 3: $T$ 9
Level 4: $i = S/e$ 0

3. Temporal Resolution and Receptive Field

Stacked dilated convolutions and deep pooling yield an extensive temporal receptive field. For kernel size $i = S/e$ 1, dilation $i = S/e$ 2: $i = S/e$ 3 Stacking two such convolutions per encoder block and four levels of pooling allows the model’s receptive field to span about 5.5 minutes at 100 Hz (approximately 33,000 samples). This large receptive field is achieved without recurrent connections, enabling the model to aggregate context over minutes of physiological data.

4. Output, Loss Functions, and Optimization

After decoding, per-segment class probabilities are computed by averaging the dense scores in non-overlapping windows: $i = S/e$ 4 followed by: $i = S/e$ 5 where $i = S/e$ 6 are logits from a $i = S/e$ 7 convolution.

Loss functions:

Generalized Dice loss (used to mitigate class imbalance): $i = S/e$ 8
Cross-entropy loss (alternative): $i = S/e$ 9 where $e$ 0 is the true one-hot label, $e$ 1.

5. Training Strategy and Hyperparameterization

U-Time is trained with the Adam optimizer ( $e$ 2, learning rate $e$ 3) using a batch-size of $e$ 4 windows, each window covering $e$ 5 target segments. Class-balanced sampling ensures each batch window contains at least one occurrence of each target class. Early stopping is applied with a patience of 150 epochs. No explicit regularization (dropout or weight decay) is used, with ≈1.2 million trainable parameters. No dataset-specific hyperparameter tuning was required; the architecture and parameters remained unchanged across all datasets.

6. Empirical Evaluation and Benchmarking

Extensive evaluation was performed across seven EEG datasets:

Sleep-EDF-39, Sleep-EDF-153 (R&K, 100 Hz)
PhysioNet-2018 (AASM, 200 Hz)
DCSM (AASM, 256 Hz)
ISRUC (AASM, 200 Hz)
CAP (R&K, 100–512 Hz)
SVUH-UCD (R&K, 128 Hz)

The primary performance metric was the global per-class F1 (Dice) score. Example results: on Sleep-EDF-39, U-Time achieved F1 = [Wake: 0.87, N1: 0.52, N2: 0.86, N3: 0.84, REM: 0.84], with a mean of 0.79; on ISRUC, mean ≈ 0.77, commensurate with human inter-rater reliability (≈0.80 for that dataset). Multi-channel variants (EEG+EOG, EEG+EOG+EMG) yielded further gains for challenging classes like REM.

7. Advantages, Limitations, and Extensions

Advantages

Feed-forward, fully convolutional architecture circumvents tuning instability of RNN-based models.
Accepts arbitrary-length input: facilitates full-length PSG inference in a single pass.
Adaptable output resolution at inference (e.g., per-segment or per-sample labeling).
Large receptive field without recurrence, robust context aggregation via pooling and dilation.
Consistent hyperparameterization across heterogeneous datasets.

Limitations and Extensions

Single-channel models cannot exploit modalities such as EOG or EMG; multi-channel extensions are available.
Assumes continuous, fixed-length windowing; application to irregular or discontinuous data would require adaptation.
Potential extensions include attention-based fusion, learned (transposed convolution) upsampling, and adversarial domain adaptation.

U-Time thus constitutes a robust, general fully-convolutional solution for physiological time-series segmentation, demonstrating high empirical performance and user-friendly deployment owing to its minimal requirements for architecture/hyperparameter tuning and its scalability to variable input and output granularities (Perslev et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

U-Time: A Fully Convolutional Network for Time Series Segmentation Applied to Sleep Staging (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to U-Time Model.