Time-series Pile: Multi-Domain Data Hub
- The Time-series Pile is a comprehensive repository that aggregates millions of time series signals across 13 diverse domains, supporting universal model pre-training.
- It standardizes data with fixed-length windows and reversible instance normalization, enabling robust, domain-agnostic model training.
- This large-scale, heterogeneous dataset drives advances in transfer learning, forecasting, and anomaly detection through rigorous benchmarking.
The Time-series Pile is a large, curated repository of public time series datasets spanning numerous domains—including healthcare, motion, environmental, economic, audio, and synthetic signals—specifically assembled to support the pre-training of foundation models for general-purpose time series analysis. Introduced in the context of the MOMENT family of time series foundation models, the Time-series Pile encompasses millions of individual series (approximately 13 million) and over 20 GB of data, addressing the lack of a cohesive, large-scale time series corpus for multi-domain model development (Goswami et al., 6 Feb 2024). This resource underpins contemporary advances in transfer learning, benchmarking, and universal representation learning for time series tasks.
1. Composition and Scope
The Time-series Pile aggregates time series data from 13 diverse domains:
- Healthcare: Electrocardiogram (ECG), Electroencephalogram (EEG), hospital monitoring signals.
- Human activity: Gesture recognition, human body movement signals.
- Natural phenomena: Outlines of biological shapes (fish, flower), river flow rates.
- Speech and audio: Phonetic and speech data.
- Power systems: Household electricity consumption, appliance-level usage.
- Economics: Exchange rates, Bitcoin, tourism time series.
- Infrastructure: Traffic velocity, weather records, spacecraft and facility telemetry.
- Synthetic: Artificially generated signals for controlled evaluation.
The dataset totals approximately 13 million unique time series consisting of billions of timestamps. The series vary widely in temporal granularity, length, amplitude, and modality, which enables pre-training models to generalize across a range of downstream tasks and time series characteristics.
2. Motivation and Distinctiveness
Unlike the language or vision domains (with resources such as The Pile or ImageNet), time series research historically lacked a single, large, and standardized corpus. This limitation hampered efforts to pre-train universal models capable of effective transfer across domains and tasks. The Time-series Pile was created to fill this gap, with careful curation to ensure representational diversity (both in terms of domain and series structure):
- Multi-domain curation enables models to learn representations that are robust to varying sampling rates, missing values, series lengths, and amplitude scales.
- The dataset's heterogeneity reflects the true breadth of real-world time series applications, which is critical for foundation model development and fair benchmarking (Goswami et al., 6 Feb 2024).
3. Structure and Preprocessing
Each univariate time series in the Pile is:
- Standardized in length (e.g., using fixed-length windows, typically 512 steps for model pre-training).
- Preprocessed with reversible instance normalization to facilitate stable, domain-agnostic model training.
- Segmented into patches for model input (e.g., non-overlapping patches of length 8, following the MOMENT architecture).
- Accompanied by metadata specifying domain, origin, and task context.
The dataset is designed for direct compatibility with transformer and masked reconstruction objectives, as well as classical and deep learning time series pipelines.
4. Benchmarking and Evaluation Use
The Time-series Pile is not only used for pre-training but also underpins a set of rigorous benchmarks for foundation model evaluation (Goswami et al., 6 Feb 2024). These benchmarks span:
- Long-horizon forecasting (testing autoregressive and sequence-to-sequence model performance over extended horizons), with metrics such as Mean Squared Error (MSE) and Mean Absolute Error (MAE).
- Short-horizon/zero-shot forecasting (evaluating direct transfer across datasets without target-domain fine-tuning), using sMAPE and other forecasting metrics.
- Classification (using external classifiers on learned features from unsupervised models), with downstream accuracy measured on UCR datasets.
- Anomaly detection and imputation (evaluating model ability to reconstruct or flag outlying segments), assessed using best-F1, VUS-ROC, and MSE/MAE, often on datasets such as TSB-UAD and the UCR anomaly archives.
This experimental design enables assessment of both representation quality and transferability under limited supervision.
5. Impact on Foundation Model Training
The Pile provides the scale and diversity necessary to pre-train large transformer-based encoders using objectives such as masked segment reconstruction:
- Models (e.g., MOMENT-1-large) are trained to reconstruct randomly masked patches of the series, yielding latent representations that encode trends, periodicity, amplitude, and even phase relationships (Goswami et al., 6 Feb 2024).
- Pre-training on the Time-series Pile enables models to outperform strictly task-specific or domain-specific models on transfer tasks, demonstrating both zero-shot and few-shot learning capabilities.
Empirical evidence from experiments shows that pre-trained models on the Pile yield better out-of-the-box and fine-tuned performance across a wide variety of real-world datasets.
6. Access, Reproducibility, and Community Adoption
The full Time-series Pile dataset is openly available on Huggingface (AutonLab/Timeseries-PILE), facilitating easy download and integration with modern time series, deep learning, and benchmarking tools (Goswami et al., 6 Feb 2024). Accompanying documentation specifies preprocessing steps, window sizes, and origin of constituent datasets to support transparent evaluation and repeatability.
Openness of both the dataset and associated pre-trained models (e.g., MOMENT-1-large) is intended to accelerate research in universal time series modeling, benchmarking, and cross-domain transfer.
7. Significance and Outlook
The introduction of the Time-series Pile marks a foundational advance in time series machine learning:
- It enables, for the first time, a systematic comparison and scaling of universal time series models using modern self-supervised and transfer learning paradigms.
- The dataset supports research into scaling laws, cross-domain generalization, and robustness benchmarking.
- The resource is expected to evolve as new domains and richer forms of time series data are incorporated, and as foundation models expand in parameter count and capability.
The Pile thus plays a role for time series analytics analogous to that of ImageNet and The Pile in vision and language, respectively, serving as a backbone for innovation in representation learning, transfer, and general-purpose model building.