- The paper presents TSRM, a novel lightweight architecture for time series forecasting and imputation that leverages hierarchical CNNs and self-attention layers.
- It introduces a stackable Encoding Layer that efficiently captures multi-scale temporal patterns through independent 1D convolutions and residual connections.
- Empirical evaluations on multiple benchmark datasets demonstrate that TSRM achieves competitive performance with significantly fewer parameters than complex Transformer models.
Here is a detailed summary of the paper "TSRM: A Lightweight Temporal Feature Encoding Architecture for Time Series Forecasting and Imputation" (2504.18878).
The paper introduces the Time Series Representation Model (TSRM), a novel and lightweight temporal feature encoding architecture designed for multivariate time series forecasting and imputation. The authors address the limitations of existing models, particularly complex Transformer-based architectures, which often suffer from high computational and memory demands when dealing with long sequences, despite recent empirical findings questioning their superiority over simpler models for time series tasks. The goal is to develop an architecture that is computationally efficient, achieves competitive performance, and offers better interpretability.
The core of the TSRM architecture is a stackable multilayered structure called the Encoding Layer (EL). Unlike models that rely on static input patching or complex 2D transformations, TSRM uses a series of ELs to learn hierarchical representations of the input time series. The architecture primarily employs a channel-independent approach, processing each feature channel separately with a shared backbone, though a variant, TSRM_IFC, is introduced to capture inter-feature correlations.
Each EL is composed of three main parts:
- Representation Layer (RL): This layer is responsible for independently learning representations at different levels of abstraction from the input sequence. It uses multiple independent 1D CNN layers with varying kernel sizes and dilations. Smaller kernels capture basic features, while larger kernels with dilation capture more comprehensive features like trends. The outputs from these CNN layers are concatenated along the sequence dimension to form a multi-scale representation. This process is designed to be more memory-efficient and have fewer parameters compared to FFT-based methods like TimesNet (Wu et al., 2022).
- Middle Encoding Blocks: Positioned between the RL and the Merge Layer, these blocks are inspired by the Transformer Encoder. They consist of layer normalization, a GeLu activation function, a multi-head self-attention mechanism (which can be standard vanilla attention or sparse attention), and a dropout operation. A second block follows, including another layer normalization, GeLu activation, a linear layer, and dropout. This linear layer is where the TSRM and TSRM_IFC variants differ: TSRM's linear layer operates independently on each feature's dimensions (d), preserving channel independence, while TSRM_IFC's linear layer spans all features and dimensions (F×d), allowing it to learn inter-feature correlations.
- Merge Layer (ML): This layer aggregates the representations and restores the original input dimensions. It uses transposed 1D convolution layers to reverse the dimensional changes introduced by the RL's CNNs. The outputs from the transposed convolutions are concatenated and then passed through a feed-forward projection to match the original input sequence dimension. The gradients for the ML can be activated or deactivated during training as a hyperparameter.
Residual connections are used within and between the ELs, similar to deep CNN frameworks, to facilitate structured feature extraction and information flow across layers. The overall architecture is shallow and wide rather than deep, contributing to its low complexity.
TSRM is evaluated on two primary time series tasks:
- Long-Term Forecasting: Given a multivariate input sequence of length T, the goal is to predict a future sequence of length H. Experiments were conducted on seven benchmark datasets (ECL, ETTm1, ETTm2, ETTh1, ETTh2, Weather, Exchange) using a fixed input length T=96 and varying prediction horizons H∈{96,192,336,720}. Performance was measured using MSE and MAE. The results demonstrate that TSRM and TSRM_IFC generally outperform or match state-of-the-art models like iTransformer (Liu et al., 2023), PatchTST (Nie et al., 2022), TimesNet (Wu et al., 2022), and others across most datasets. Notably, TSRM_IFC showed better results on datasets like Weather and Exchange, suggesting the importance of inter-feature correlation for these data.
- Imputation: Given a multivariate input sequence with a percentage (rm) of missing values, the task is to reconstruct the original sequence (H=T). Experiments were performed on six datasets (ECL, ETTm1, ETTm2, ETTh1, ETTh2, Weather) with missing rates rm∈{12.5%,25%,37.5%,50%}. The imputation loss is a weighted sum of MAE and MSE on both masked and unmasked regions. TSRM achieved strong performance, particularly on the ECL and Weather datasets, outperforming LightTS (Zhang et al., 2022) and DLinear (Zeng et al., 2022) across all evaluated datasets. While not consistently better than TimesNet (Wu et al., 2022) on all ETT subsets for imputation, it remained competitive.
A key advantage highlighted is TSRM's low computational complexity and memory footprint. Compared to other SOTA models which often have millions of parameters (median around 6.9M), TSRM typically operates with only a few hundred thousand trainable parameters, making it significantly more lightweight (median around 0.9M).
Ablation studies were conducted to understand the contribution of different architectural components. Varying the number of ELs (N) showed that performance generally improves with increasing N up to a certain point, with models using ELs significantly outperforming a linear model (N=0). Experiments modifying the RL (reducing CNN layers or kernel sizes) and the ML (deactivating learning) confirmed the critical role of the CNN-based representation learning in the RL for capturing temporal patterns. Removing structural learning in the RL (kernel size 1) led to a substantial performance drop, highlighting the importance of convolutional operations.
The architecture is also designed with explainability in mind. By extracting and transforming the attention weights from the ELs, it is possible to visualize which parts of the input time series were most attended to by the model for each feature and each layer. This provides a degree of insight into the model's decision-making process and representation learning.
In conclusion, the paper presents TSRM as an effective and efficient architecture for time series forecasting and imputation. Its hierarchical, CNN-based representation learning combined with self-attention mechanisms achieves competitive performance on benchmark datasets while drastically reducing the number of trainable parameters compared to existing SOTA models. The authors plan to explore applications to other time series tasks, pretraining/fine-tuning strategies, few/zero-shot learning, and foundation models in future work.