Montreal Archive of Sleep Studies (MASS)

Updated 4 June 2026

MASS is a standardized, multi-center repository of full-night PSG recordings from 200 adult subjects, organized into five subsets and annotated by both AASM and R&K protocols.
The dataset facilitates robust benchmarking of automatic sleep staging algorithms and supports research in transfer learning, inter-individual variability, and cross-protocol harmonization.
It includes detailed metadata with varied EEG montages, 30-second epoch segmentation, and reported performance metrics such as up to 87% accuracy in deep learning benchmarks.

The Montreal Archive of Sleep Studies (MASS) is a large-scale, multi-center repository of polysomnographic (PSG) recordings acquired from adult volunteers under closely controlled laboratory conditions. MASS houses data from 200 subjects, encompasses multiple PSG montages, and consolidates annotations under both American Academy of Sleep Medicine (AASM) and Rechtschaffen & Kales (R&K) protocols. As a result, it provides a standardized, comprehensive resource enabling robust benchmarking of automatic sleep staging algorithms, transfer learning, and methodological investigations into inter-individual and inter-protocol variability.

1. Dataset Composition and Structure

MASS consists of whole-night PSG recordings from 200 adult participants (97 males, 103 females), aged 18–76 years. Each subject typically contributes a single full-night session (7–9 hours) (Phan et al., 2019, Phan et al., 2018, Phan et al., 2019). The database is partitioned into five subsets (SS1–SS5) as curated by O’Reilly et al. (2014). Annotation protocols vary: SS1 and SS3 employ the AASM rules, while SS2, SS4, and SS5 adopt R&K guidelines (Phan et al., 2019, Phan et al., 2018).

Table 1: Fundamental Characteristics of MASS

Characteristic	Value	Reference
Total subjects	200 (97 M/103 F)	(Phan et al., 2019)
Age range	18–76 years	(Phan et al., 2019)
PSG channels (core)	EEG (C4–A1), EOG (ROC–LOC), EMG (CHIN1–CHIN2)	(Phan et al., 2019)
Sampling rate	256 Hz (downsampled to 100 Hz in many studies)	(Phan et al., 2019)
Subsets scoring protocols	AASM: SS1, SS3; R&K: SS2, SS4, SS5	(Phan et al., 2018)

Recordings include standard EEG leads (notably C4–A1 for most published analyses), bilateral EOG, chin EMG, and—for some protocols—additional electrodes and cardiorespiratory measures. The MASS-SS3 subset, specifically, features an extended montage with 20 EEG electrodes (10–20 system), 2 EOG, 3 chin EMG, and 1 ECG channel (Einizade et al., 2022).

2. Annotation Protocols and Stage Harmonization

Sleep stage scoring within MASS follows two distinct but harmonized conventions. SS1 and SS3 apply the AASM manual (Iber et al. 2007), providing five-stage annotation: Wake (W), N1, N2, N3, and REM. SS2, SS4, and SS5 follow R&K rules, wherein N3 and N4 are collapsed into a single N3 class for compatibility with AASM nomenclature (Phan et al., 2019, Phan et al., 2018, Phan et al., 2019). All epochs are ultimately mapped into the five canonical classes {W, N1, N2, N3, REM}. In studies requiring 30 s epochs, original 20 s segments (R&K) are typically expanded by concatenating flanking data to achieve 30 s windows (Phan et al., 2019, Phan et al., 2019).

Typical class distributions in the combined dataset are: Wake 5–10%, N1 5–15%, N2 40–50%, N3 10–20%, REM 15–25% (Phan et al., 2019). The MASS-SS3 subdataset, used in some deep-learning benchmarking, is scored strictly per the AASM standard (Einizade et al., 2022).

3. Signal Acquisition and Preprocessing

Raw MASS signals were originally captured at a nominal sampling rate of 256 Hz and subsequently down-sampled to 100 Hz in multiple published pipelines to facilitate harmonized analysis and computational tractability (Phan et al., 2018, Phan et al., 2019). Acquisition setups differ by subset, but the C4–A1 EEG derivation is the canonical signal for most algorithmic development. For MASS-SS3, all channels use 256 Hz (Einizade et al., 2022).

Several signal preprocessing protocols have been reported:

Epoch segmentation: 30 s epochs are standard for both AASM and harmonized R&K records (Phan et al., 2018).
Spectrotemporal transformation: Short-time Fourier Transform (STFT) with 2 s Hamming windows, 50% overlap, and 256-point FFT yields log-power spectrograms of size 129×29 (frequency × time) (Phan et al., 2018).
Dimensionality reduction: Learned filter banks (e.g., M=20 filters/channel) emphasize informative frequency subbands (Phan et al., 2018).
For MASS-SS3, each channel is decomposed into nine overlapping subbands spanning 0.5–50 Hz, with Differential Entropy (DE) computed per subband (Einizade et al., 2022):

$DE(f,B) = -\int p(x)\log p(x)dx$

(implemented as band-based variance under a Gaussian assumption).

Artifact removal is not systematically reported, except for exclusion of subjects/nights with corrupt or missing data during curation. No additional band-pass or notch filters are described in the most cited studies (Phan et al., 2019).

4. Dataset Partitions and Evaluation Protocols

Partitioning schemes in MASS analyses are designed to ensure subject independence between training and test sets. Common approaches include:

Cross-subject K-fold: 20-fold (180 train/10 validation/10 test per fold) (Phan et al., 2018, Phan et al., 2019); 16-fold or 10-fold in subsets or transfer learning (Einizade et al., 2022, Phan et al., 2019).
Transfer learning protocols: MASS (200 subjects) is used exclusively for source (pretraining) domain with no within-dataset validation, while adaptation is performed on smaller target sets (e.g., Sleep-EDF) using leave-one-subject-out (Phan et al., 2019).

In the MASS-SS3 subset, a 16-fold cross-subject split is adopted: 15 folds of four subjects each, one fold of two subjects, with random fold assignment for validation to balance class distributions (Einizade et al., 2022).

5. Quantitative Characteristics and Performance Benchmarks

MASS presents a marked class imbalance: N2 comprises the majority of epochs, while N1 remains scarce (Phan et al., 2018). Approximate ground-truth counts (inferred from confusion matrices, 200-subject aggregate) are:

Stage	Epochs
W	~30,440
N1	~14,312
N2	~109,157
N3	~30,411
REM	~41,295

Published per-epoch classification performances on MASS are reported using metrics such as overall accuracy, macro F1-score, and Cohen’s kappa:

"SeqSleepNet+": 3-channel (EEG·EOG·EMG) Acc = 87.0%, MF1 = 83.3%, κ = 0.815 (Phan et al., 2019)
"Joint CNN": Acc = 83.6%, κ = 0.77, MF1 = 77.9% (3 channels) (Phan et al., 2018)
"ProductGraphSleepNet" on MASS-SS3: Acc = 86.7%, F1 = 0.818, κ = 0.802 (Einizade et al., 2022)

Formulas for these metrics are given explicitly in (Phan et al., 2019), for example:

Accuracy:

$\mathrm{Accuracy} = \frac{\sum_{c}\mathrm{TP}_c}{\sum_{c}(\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}_c)}$

Performance degrades for N1 and at stage transitions (∼16.6% of epochs), which remain inherently ambiguous. Per-class recall is substantively lower for N1 (e.g., 41.1%) than for N2 (88.5%) or REM (93.3%) (Phan et al., 2018).

6. Applications and Methodological Significance

MASS has become the dominant benchmark for automated sleep staging algorithm development, including deep learning architectures such as convolutional neural networks, sequence-to-sequence models, graph-based methods, and transfer learning frameworks (Phan et al., 2018, Phan et al., 2019, Phan et al., 2019, Einizade et al., 2022). Its size and diversity facilitate training of models with broad generalization and support evaluation of domain adaptation techniques, channel mismatch solutions, and cross-cohort robustness.

In transfer learning settings, MASS is consistently used as the source domain to pretrain models subsequently adapted to smaller datasets, addressing data efficiency and inter-dataset heterogeneity. The harmonization of AASM and R&K protocols within a unified five-stage schema enables cross-protocol comparisons and pooling.

The MASS-SS3 subdataset, with its dense EEG montage and AASM-based scoring, underpins methodological advances in spatio-temporal modeling and interpretable graph neural networks (Einizade et al., 2022).

7. Limitations and Dataset-Specific Challenges

Several limitations are noted in published studies:

Incomplete demographic coverage in published subsets (e.g., SS3 reports do not specify age/gender breakdown) (Einizade et al., 2022).
Lack of artifact rejection or explicit signal denoising in standard pipelines (Phan et al., 2018, Phan et al., 2019).
Persistent class imbalance and stage boundary ambiguity, which motivates stratified training, special aggregation techniques, and contextual modeling (Phan et al., 2018).
Inter-protocol annotation heterogeneity, though mitigated by canonicalization to five stages, could introduce unquantified sources of variance (Phan et al., 2019).

A plausible implication is that researchers considering inter-study comparisons should examine the underlying annotation and preprocessing conventions in detail. For fine-grained demographic or clinical stratification, the original release (O’Reilly et al. 2014) should be consulted for detailed metadata and per-stage statistics, which are not routinely recapitulated in downstream algorithm papers (Einizade et al., 2022).

References:

(Einizade et al., 2022, Phan et al., 2018, Phan et al., 2019, Phan et al., 2019)