CNN-LSTM Model for Time Series Forecasting

Updated 27 November 2025

CNN-LSTM models are hybrid architectures that integrate CNN's local feature extraction with LSTM's ability to capture long-term dependencies for robust time series forecasting.
The model leverages techniques such as sliding window framing, normalization, and multi-layer processing to effectively handle noise and non-stationary data in applications like meteorology and finance.
Empirical evaluations show that CNN-LSTM architectures consistently outperform standalone models by reducing forecasting errors and enhancing predictive accuracy.

A Convolutional Neural Network–Long Short-Term Memory (CNN-LSTM) model for time series forecasting is a hybrid deep learning architecture in which one or more convolutional layers (CNN) act as a local feature extractor over short subsequences of the input, and one or more LSTM layers subsequently learn long-range temporal dependencies or sequence dynamics based on the higher-level feature representations produced by the CNN. This model class, including its common architectural and methodological variants, has emerged as a state-of-the-art strategy for both univariate and multivariate time series prediction under nonlinearity and noise, outperforming traditional approaches in domains ranging from meteorology and demand forecasting to financial modeling and environmental sciences.

1. Foundational Principles

The CNN-LSTM paradigm leverages the complementary strengths of two neural network families:

CNNs efficiently extract position-invariant, locally coherent motifs or short-term temporal features from time series data by applying 1D (or higher-dimensional for gridded input) convolutional filters across the input sequence. Stacking multiple convolutional layers enables multi-scale feature extraction.
LSTMs and their bidirectional or stacked variants model long-range temporal dependencies and sequential patterns, maintaining memory of distant events and handling vanishing gradients via recurrent gating mechanisms.

The hybrid structure is designed to (i) denoise or transform the raw input with a convolutional pipeline—often used to reduce dimensionality or highlight spatial/short-term characteristics—and (ii) process this richer feature space with the temporal memory of LSTM, improving predictive performance particularly in highly noisy or non-stationary environments (Shen et al., 2024, A et al., 2023, Hu et al., 2020, Lara-Benítez et al., 2021).

2. Canonical Architecture and Variations

The canonical workflow accepts a sliding window of length $T$ steps (univariate or multivariate), processes it with $M$ layers of 1D CNN, and outputs a reduced or transformed sequence to one or more LSTM layers. Detailed instantiations include:

Meteorological Forecasting A representative model for temperature prediction employs a dual-layer 1D CNN with 256 and 128 filters (kernel size 2), followed by a max-pooling layer, flattening, and broadcasting the resulting vector through a RepeatVector layer into three stacked LSTM layers (each with 100 units, dropout 0.2–0.3), culminating in a bidirectional LSTM (128 units) and attention or dense output (Shen et al., 2024, Li et al., 2024).
Financial Time Series Forecasting stock prices or demand typically uses a TimeDistributed CNN that processes windows (approximate length 100) per block, three Conv1D layers (e.g., [64, 128, 64] filters, kernel size 3), max-pooling, and two stacked bidirectional LSTM layers (100 units), followed by a dense regression head (A et al., 2023).
Multivariate/Multi-source Fusion For multivariate settings, gridded data (e.g., spatial fields) can be processed by 2D or time-distributed 2D convolution, with output feature vectors sequenced into an LSTM block (Pokharel et al., 2024, Hu et al., 2020).
Decomposition-driven Variants In the VMD-CNN-LSTM framework, Variational Mode Decomposition of the input produces several interpretable frequency components, which are denoised/reconstructed via a CNN, concatenated, and used as LSTM input. This yields further accuracy improvements in regimes with strong periodicity or noise (Zhang et al., 2020).
Attention-enhanced CNN-LSTM Attention mechanisms, typically additive (Bahdanau) attention, are layered on top of the LSTM output sequence to focus the prediction on the most informative temporal features, benefiting trend and inflection-point capture for nonstationary data (Shen et al., 2024).

3. Mathematical Formulation and Model Flow

The CNN-LSTM architecture consists of serial (and sometimes branched) composition:

1D CNN Block: For input $x \in \mathbb{R}^{T \times F}$ , the output of a Conv1D layer at time $i$ , filter $k$ :

$y_{i,k} = \sum_{m=0}^{K-1} x_{i+m} W_{m, k} + b_k$

Multiple stacked layers extract multi-scale features, e.g., patterns over 2, 3, or more time steps.

Pooling/Flattening: MaxPooling1D or similar reduces temporal dimension, flattening for LSTM input.
LSTM (or BiLSTM) Block: At each $t$ , standard LSTM equations compute hidden/cell states:

$i_{t} = \sigma(W_{i} x_{t} + U_{i} h_{t-1} + b_{i}), \ f_{t} = \sigma(W_{f} x_{t} + U_{f} h_{t-1} + b_{f}), \ o_{t} = \sigma(W_{o} x_{t} + U_{o} h_{t-1} + b_{o}), \ g_{t} = \tanh(W_{g} x_{t} + U_{g} h_{t-1} + b_{g})$

Cell and hidden updates:

$c_{t} = f_{t} \odot c_{t-1} + i_{t} \odot g_{t}, \ h_{t} = o_{t} \odot \tanh(c_{t})$

(Optional) Attention Layer: For sequence $\{h_t\}$ , calculate context $c$ as:

$e_t = v^\top \tanh(W_h h_t + b_h)$

$\alpha_t = \frac{\exp(e_t)}{\sum_j \exp(e_j)}, \quad c = \sum_{t} \alpha_t h_t$

Dense/Regression Output: Final prediction via one or more fully connected layers.

Table: Common architectural parameters

Layer	Typical Value Range	Role
Conv1D	64–256 filters, k=2–5	Local motif extraction
MaxPooling1D	pool_size=2	Downsampling
LSTM/BiLSTM	64–128 units	Sequence modeling
Dense Output	1–128 units	Regression or classification

4. Data Preprocessing, Feature Engineering, and Training Protocol

Effective deployment of a CNN-LSTM forecasting system depends critically on input pipeline design:

Sliding window framing: Fixed-length windows of $T$ previous steps, possibly multi-variate or with engineered time/calendar features, serve as individual samples (Shen et al., 2024, Li et al., 2024, A et al., 2023).
Normalization: MinMaxScaler to $[-1, 1]$ or Z-score per feature channel; consistent scaling between train/validation/test splits is essential (Shen et al., 2024, Li et al., 2024).
Missing value handling: Imputation by feature mean or interpolation; removal of erroneous segments (Shen et al., 2024).
Target construction: Most common setup is 1-step-ahead prediction ( $T \to T+1$ ), but multi-horizon and sequence-to-sequence variants are common in environmental and energy forecasting (Hu et al., 2020, Pokharel et al., 2024).
Training regimen: Mean squared error (MSE) or mean absolute error (MAE); Adam or NAdam optimizer; early stopping with patience on validation loss; batch size and epoch number dependent on dataset and convergence (Shen et al., 2024, Li et al., 2024).

5. Empirical Performance and Comparative Benefits

Extensive empirical studies demonstrate the superior accuracy and robustness of CNN-LSTM models over standalone LSTMs, plain CNNs, and classical baselines:

Temperature prediction (Eastern China): MSE = 1.978, RMSE = 0.811 on held-out data; marked improvement over single-model alternatives (Shen et al., 2024).
Delhi meteorological series: CNN-LSTM reduces RMSE by ~20–35% compared to LSTM and ARIMA baselines, with MSE = 3.26, RMSE = 1.81 (Li et al., 2024).
Stock price forecasting: Average improvement in MSE of ≈20% over pure LSTM, with R² ≈ 0.935–0.998 across diverse equity datasets (A et al., 2023, Chakraborty et al., 2024, Ranjbar et al., 2024).
Multivariate streaming data: Multimodal CNN-LSTM architectures outperform traditional ARIMA and Prophet in both corporate finance and AWS billing scenarios, with gains accounted for by the capacity of convolution to capture inter-series correlation and learned differencing (Hu et al., 2020).
Hydrological prediction: CNN-LSTM improves Kling–Gupta Efficiency from 0.76 (LSTM) to 0.78 (hybrid) on streamflow benchmarks; basin-level improvements as high as KGE=0.91 in difficult catchments (Pokharel et al., 2024).
Ablation and error analysis: Addition of CNN-driven local feature extraction consistently yields 2–10% reduction in forecasting error versus LSTM-only networks, particularly in settings of strong short-term motifs or nonstationary noise (Lara-Benítez et al., 2021, Zhang et al., 2020, Ranjbar et al., 2024).

6. Variants, Limitations, and Extensions

Substantial architectural flexibility permits adaptation:

Structural decomposition: The CNN-LSTM component can be plugged into more complex frameworks such as VMD-based ensemble systems (for periodic or decomposable series), explicit seasonality/event models, or transformer-driven hybrid networks (Zhang et al., 2020, Ranjbar et al., 2024).
Attention modules: Adoption of self-attention or Transformer blocks further enhances focal forecasting, especially in multi-horizon or long sequence settings (Shen et al., 2024).
Auxiliary and multivariate inputs: Stacking multiple series and exploiting learned convolutional fusion enables modeling of correlated processes, cross-feature interactions, and spatial information (e.g., gridded meteorological predictors) (Hu et al., 2020, Pokharel et al., 2024, Tzoumpas et al., 2022).
Limitations: Overfitting risk in very deep/stochastic CNN-LSTM networks; sensitivity to normalization and window parameters; interpretability challenges vs. linear or additive models; possible degradation in settings requiring complex event modeling unless appropriately expanded (Hu et al., 2020).

7. Practical Implementation Guidelines

For research groups and practitioners:

Input construction: Window length ≈ forecast horizon or slightly larger (1.25×H) is generally optimal (Lara-Benítez et al., 2021), with proper normalization per input type.
Architecture: Two to four CNN layers (filters=32–256, kernel=2–5), 1–3 LSTM layers (64–128 units), optional bidirectionality or attention head.
Training: Adam/NAdam optimizer, batch size 32–64, early stopping or regularization.
Evaluation: MSE, MAE, RMSE, and domain-specific metrics (R², KGE, WAPE, MAPE).
Hyperparameter search: Learning rate (1e-3 default), number of filters/units, window size, dropout rate.
Deployment: For high-latency or resource-constrained applications, prune LSTM layers or replace with temporal convolutional networks; employ model quantization or knowledge distillation as required (Li et al., 2024, Lara-Benítez et al., 2021).

The CNN-LSTM model is firmly established as a foundational element in modern deep time series forecasting, offering robust, scalable, and empirically validated solutions across a spectrum of domains with complex spatiotemporal dynamics (Shen et al., 2024, Li et al., 2024, A et al., 2023, Hu et al., 2020, Pokharel et al., 2024, Ranjbar et al., 2024, Lara-Benítez et al., 2021).