Papers
Topics
Authors
Recent
Search
2000 character limit reached

AttCLX Model: Hybrid Forecasting

Updated 9 March 2026
  • AttCLX is a hybrid forecasting model that integrates classical time-series techniques, deep learning, and XGBoost regression for robust stock price prediction.
  • It employs ARIMA detrending and a multi-scale CNN with self-attention and BiLSTM decoding to capture local and long-range temporal patterns.
  • The pretrain–fine-tune architecture achieves superior accuracy, outperforming standalone approaches in key metrics like RMSE, MAE, and R².

The Attention-based CNN-LSTM and XGBoost hybrid model (AttCLX) is a multi-stage machine learning architecture designed to predict stock prices by integrating classical time-series techniques, deep neural sequence models, and gradient-boosted tree ensembles in a pretrain–fine-tune pipeline. AttCLX preprocesses market data with ARIMA for detrending, encodes multi-scale and long-range features with a deep attentional CNNBiLSTM sequence-to-sequence model, and leverages XGBoost as a final regressor for robust prediction. AttCLX achieves state-of-the-art single-step daily forecasting accuracy on empirical financial data, demonstrating superior performance to standalone classical and deep learning baselines (Shi et al., 2022).

1. Model Pipeline Overview

AttCLX is structured as a three-stage system:

  1. ARIMA detrending and residual computation: Removes linear trends from the raw price series and generates both differenced and residual series as additional features.
  2. Attentional CNN–BiLSTM sequence encoding: Utilizes a convolutional encoder with an integrated self-attention layer to extract multi-scale and global patterns, followed by a deep bidirectional LSTM decoder for long-range temporal dependencies.
  3. XGBoost regressor fine-tuning: Consumes the neural sequence representation (typically the final hidden state) alongside selected engineered features and outputs the final price prediction.

The interaction of these components is visualized as:

1
2
3
4
5
[ARIMA preprocessing → features (raw + diff + residual)]
              ↓
[Attentional CNN → multi-head self-attention → Bi-LSTM decoder]
              ↓
[XGBoost regressor → final prediction]
This integration allows the model to leverage detrended and residual information, rich nonlinear feature extraction, and the flexibility of tree ensembles for error correction.

2. Mathematical Specification of Model Components

2.1 ARIMA Preprocessing

Given a price series {st}\{s_t\}, AttCLX applies ARIMA(2,1,0): first, price differences are computed as xt=stst1x_t = s_t - s_{t-1}, and modeled via AR(2):

xt=a0+a1xt1+a2xt2+wt,wtwhite noisex_t = a_0 + a_1 x_{t-1} + a_2 x_{t-2} + w_t,\quad w_t \sim \text{white noise}

Residuals,

rt=xt(a0+a1xt1+a2xt2),r_t = x_t - (a_0 + a_1 x_{t-1} + a_2 x_{t-2}),

are derived and concatenated with raw market features: open, high, low, close, volume, and amount; yielding an 8-dimensional feature vector at each time step.

2.2 CNN Feature Extraction

Let XRT×FX \in \mathbb{R}^{T \times F} be the input matrix over a look-back window of T=20T=20 and F=8F=8 features. Multiple 1D convolutions along the time axis extract temporal motifs:

C(k)=ReLU(XW(k)+b(k)),W(k)Rk×F×CC^{(k)} = \text{ReLU}(X * W^{(k)} + b^{(k)}),\quad W^{(k)} \in \mathbb{R}^{k \times F \times C}

for kernel sizes k{2,3,5}k \in \{2,3,5\}, C=64C=64 filters each. Features from all scales are concatenated (HcnnRT×(Cscales)H_\text{cnn} \in \mathbb{R}^{T \times (C \cdot \vert\text{scales}\vert)}).

2.3 Self-Attention

The CNN feature output is mapped to Q,K,VQ, K, V projections:

Q=HcnnWQ,K=HcnnWK,V=HcnnWV,WQ,WK,WVRD×dQ = H_\text{cnn} W_Q,\quad K = H_\text{cnn} W_K,\quad V = H_\text{cnn} W_V,\quad W_Q, W_K, W_V \in \mathbb{R}^{D \times d}

with D=CscalesD = C \cdot |\text{scales}|, d=64d=64. For each time tt, compute:

et,i=qtkid;αt,i=exp(et,i)j=1Texp(et,j)e_{t,i} = \frac{q_t^\top k_i}{\sqrt{d}};\qquad \alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^T \exp(e_{t,j})}

and attend:

ct=i=1Tαt,ivic_t = \sum_{i=1}^T \alpha_{t,i} v_i

Four attention heads are used. Concatenated context vectors {c1,,cT}\{c_1, \ldots, c_T\} are passed to the decoder.

2.4 BiLSTM Decoder

A stack of L=5L=5 bidirectional LSTM layers (H=64H=64 per direction) processes the sequence {ct}\{c_t\}: \begin{align*} f_t &= \sigma(W_f [h_{t-1}, x_t] + b_f) \ i_t &= \sigma(W_i [h_{t-1}, x_t] + b_i) \ o_t &= \sigma(W_o [h_{t-1}, x_t] + b_o) \ \hat{g}t &= \tanh(W_g [h{t-1}, x_t] + b_g) \ C_t &= f_t \odot C_{t-1} + i_t \odot \hat{g}_t \ h_t &= o_t \odot \tanh(C_t) \end{align*} Bidirectional outputs at all time steps are concatenated (HdecRT×2HH_\text{dec} \in \mathbb{R}^{T \times 2H}). Either the full decoded sequence or the final time-step embedding hTR128h_T \in \mathbb{R}^{128} is used for downstream regression.

2.5 XGBoost Regression

The feature vector for XGBoost consists of the last Bi-LSTM hidden state, optionally augmented with ARIMA residuals and CNN summaries. XGBoost constructs an additive ensemble of K=100K=100 regression trees, using squared error loss and regularization:

Obj=i=1nl(yi,y^i)+k=1KΩ(fk),Ω(f)=γleaves(f)+12λww2\text{Obj} = \sum_{i=1}^{n} l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k),\quad \Omega(f) = \gamma |leaves(f)| + \frac{1}{2}\lambda \sum_w w^2

with learning rate $0.1$, max depth $6$, λ=1\lambda=1, γ=0\gamma=0.

3. Training Regimens and Hyperparameters

  • ARIMA: (p,d,q)=(2,1,0)(p,d,q) = (2,1,0), selected via ADF test and ACF/PACF analysis on first-differenced series.
  • CNN-Attention-BiLSTM: Look-back window T=20T=20, F=8F=8; CNN kernels {2,3,5}\{2,3,5\}, C=64C=64 per kernel; 4 attention heads, d=64d=64; BiLSTM L=5L=5, H=64H=64 per direction; dropout $0.3$; batch size $32$; Adam optimizer, learning rate $0.01$; trained for up to $50$ epochs with early stopping ($10$ epochs no improvement).
  • XGBoost: nestimators=100n_\text{estimators}=100, learning rate =0.1=0.1, max_depth =6=6, subsample and colsample_bytree =1.0=1.0, λ=1\lambda=1, γ=0\gamma=0.

The sequence-to-sequence neural network is pretrained on the stock sequence, then the resulting representations are used to fit the XGBoost regressor.

4. Empirical Evaluation and Comparative Analysis

4.1 Component Ablations

Ablation experiments evaluated combinations of neural pretraining and XGBoost fine-tuning on daily closing price prediction for Bank-of-China (601988.SH, 2007–2022, split at 2021-06-22):

Pretraining Fine-tuning RMSE MAE
None None 0.02734 0.02368 0.7440
None XGBoost 0.01755 0.01223 0.8241
SL-LSTM SL-LSTM 0.02282 0.01960 0.7943
ML-LSTM ML-LSTM 0.01720 0.01265 0.8235
BiLSTM BiLSTM 0.01652 0.01201 0.8421
BiLSTM XGBoost 0.01605 0.01187 0.8630
CNN-BiLSTM XGBoost 0.01529 0.01145 0.8772
ACNN-BiLSTM XGBoost 0.01424 0.01126 0.8834

Results indicate that the integration of AttCLX components progressively reduces prediction error; the attention-augmented CNN-BiLSTM sequence, when coupled with XGBoost, yields the highest accuracy.

4.2 Comparison to State-of-the-Art

AttCLX was compared to classical and recent models:

Model RMSE MAE MAPE
ARIMA 0.02734 0.02368 0.02368 0.7440
ARIMA-NN (’03) 0.02608 0.02350 0.02350 0.7504
LSTM-KF (’21) 0.02381 0.02192 0.02192 0.7625
Transformer-KF 0.01924 0.01525 0.01525 0.8023
TL-KF 0.01656 0.01372 0.01372 0.8192
AttCLX 0.01424 0.01126 0.01126 0.8834

AttCLX outperforms all baselines in RMSE, MAE, MAPE, and R², with its RMSE (0.01424) and MAE (0.01126) representing a substantial reduction over transformer and LSTM-based models.

5. Design Principles and Architectural Rationale

  • Modular Detrending and Nonlinear Extraction: The separation of linear ARIMA preprocessing from nonlinear neural encoding enables the model to exploit both statistical and deep learning strengths.
  • Multi-scale Temporal Pattern Modeling: CNNs with multi-scale kernels detect short, local, and longer time-lag features critical in stock sequences.
  • Enhanced Context via Attention: Multi-head self-attention layers capture dependencies beyond the receptive field limits of convolutions and LSTMs, incorporating global temporal information.
  • Long-Range Memory with Deep BiLSTM: BiLSTM layers ensure both forward and backward dependencies, with a depth sufficient to capture complex time series phenomena.
  • Flexible Nonlinear Ensemble via XGBoost: Ensemble regression trees adaptively model any residual nonlinearity or feature interactions not captured by the preceding neural extractors.

A plausible implication is that this hybridization, particularly the use of XGBoost on neural sequence encodings, allows AttCLX to correct systematic neural errors while leveraging the expressive power of tree ensembles.

6. Significance and Application Scope

AttCLX demonstrates practical effectiveness for high-variance, nonlinear, and non-stationary financial time series forecasting. The design is extensible to other time series regimes where both long-range nonlinear dependencies and complex engineered/regression features are relevant. Empirical results suggest robust generalization and error reduction in domains where classical and pure neural approaches are suboptimal (Shi et al., 2022). Source code is available at https://github.com/zshicode/Attention-CLX-stock-prediction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AttCLX Model.