Residual Stacked Gaussian Linear Model

Updated 11 October 2025

The RSGL model is a deep, residual-based forecasting architecture that employs stacked linear blocks with Gaussian nonlinearities to capture complex temporal dependencies.
It incorporates RevIN normalization, dropout, and skip connections to enhance robustness against non-stationarity and gradient issues in high-dimensional data.
Empirical results demonstrate significant improvements over shallow models, with error reductions up to 26.5% in financial and epidemiological forecasting benchmarks.

The Residual Stacked Gaussian Linear (RSGL) model is an architecture for multivariate time series forecasting and high-dimensional regression that leverages stacked linear transformations with Gaussian-based nonlinearities and residual connections. Designed to address the limitations of shallow linear models, RSGL improves long-range dependency modeling, robustness to non-stationarity, and generalization to complex datasets, including financial and epidemiological series. Its conception and enhancements over previous architectures are supported by comprehensive mathematical analysis and experimental validation on benchmark and real-world datasets.

1. Architectural Foundations and Model Formulation

The RSGL model extends the Gaussian-based Linear (GLinear) framework, which consists of a pair of fully connected (linear) layers separated by a Gaussian Error Linear Unit (GeLU) activation, with Reversible Instance Normalization (RevIN) preceding and following the core transformations (Ali, 4 Oct 2025). The RSGL model increases architectural depth by stacking four linear blocks, each organized as a residual block:

Each residual block applies a fully connected linear transformation, a GeLU activation (approximated as $\operatorname{GeLU}(x) = 0.5x[1 + \tanh(\sqrt{2/\pi}(x + 0.04471x^3))]$ ), and a dropout layer for regularization.
A skip connection adds the block’s input to its output: $h(x) = F(x) + x$ , where $F(x)$ is the block’s nonlinear transformation.

The full pipeline uses:

RevIN normalization: standardizes temporal inputs for adaptation to distributional shifts.
Four stacked linear blocks: each with GeLU, dropout, and residual skip.
RevIN denormalization: restores predictions to their original scale.

This design provides robustness to gradient vanishing/explosion in deeper linear stacks and maintains undistorted input signals through identity mapping when intermediate non-linearities yield near-zero outputs.

2. Methodological Enhancements: Depth, Regularization, and Normalization

RSGL introduces several enhancements over shallow Gaussian linear models (Ali, 4 Oct 2025):

Increased Depth: Instead of a single hidden layer, four sequential residual blocks enable the network to represent more complex and long-range temporal dependencies. This extends modeling capacity to multi-scale patterns inherent in multivariate time series.
Residual Connections: By implementing skip connections within each block, RSGL mitigates degradation effects commonly observed in deep architectures and preserves input characteristics.
Dropout Regularization: Dropout after each GeLU activation provides stochastic regularization, which improves generalization and reduces overfitting risk, especially in noisy or limited-data regimes.
RevIN Layers: Pre- and post-normalization layers ensure resilience against non-stationarity in input distributions, facilitating adaptability across various domains.

3. Connections to High-Dimensional Gaussian Linear Models

RSGL finds theoretical grounding in high-dimensional linear regression analysis (Dicker, 2012), where residual variance ( $\sigma^2$ ) and signal strength ( $\tau^2$ ) estimators are central diagnostic tools. Importantly:

The RSGL framework supports unbiased estimation of residual variance and signal-to-noise ratio (SNR) even when the number of predictors $d$ exceeds the number of observations $n$ .
Estimators for $\sigma^2$ $σ^{2}$ and $\tau^2$ $τ^{2}$ :
- $\hat{\sigma}^2 = \frac{d + n + 1}{n(n + 1)}\|y\|^2 - \frac{1}{n(n + 1)}\|X^\top y\|^2$
- $\hat{\tau}^2 = -\frac{d}{n(n + 1)}\|y\|^2 + \frac{1}{n(n + 1)}\|X^\top y\|^2$
- are valid for dense signals, require no sparsity assumptions, and remain consistent as $d/n$ diverges or converges to a finite constant.
Asymptotic normality results provide error bounds for inferential procedures related to SNR estimation.

RSGL’s estimation methodology utilizes statistical properties of the Wishart distribution and random matrix theory, ensuring robust performance in non-sparse, high-dimensional settings.

4. Empirical Performance Across Domains

Extensive experiments on benchmark datasets (Electricity, ETTh1, Weather, Traffic, financial time series, and epidemiological data) demonstrate (Ali, 4 Oct 2025):

Accuracy: RSGL outperforms the original GLinear and several Transformer-based forecasting models (Autoformer, Informer, etc.) in long-horizon prediction tasks. For instance, on ETTh1 at a 720-step horizon, RSGL reduced MSE by 26.5% and MAE by 14.7% relative to GLinear.
Domain Robustness: While improvements are seen across electricity and traffic datasets, the gains are less pronounced on weather data (potentially due to inherent nonseasonal noise). RSGL shows competitive performance on financial and epidemiological datasets, indicating adaptability to various time series characteristics.
Limitations: The architecture’s benefit diminishes when the input and prediction window lengths are equal (e.g., both 336 steps), where RSGL matches the GLinear baseline. This sensitivity highlights a contextual limitation in situations with stationary historical/future window configuration.

5. Extensions to Ensemble Gaussian Process Frameworks

Editor's term: Gaussian Process Stacked Generalisation (GP-SGL)

RSGL concepts connect to ensemble Gaussian process models employing stacked generalisation (Bhatt et al., 2016). In disease risk mapping, several non-linear base learners (including gradient boosted trees, random forests, elastic net, etc.) are combined in a level-1 Gaussian process (GP) model, embedding a spatial (or spatiotemporal) covariance kernel atop the ensemble mean. This hybrid stacking:

Achieves superior predictive accuracy compared to individual models or unconstrained stacking.
Offers explicit mathematical error reduction properties:
- $(\mathbb{I} - \Sigma_2\Sigma_1^{-1})e_{GP}(x) \leq e_{CWM}(x)$ ,
- demonstrating that spatial residual modeling further lowers prediction error beyond simple convex stacking.
Enables efficient implementation via the SPDE approach and GMRF approximations, crucial for large-scale, high-dimensional spatial inference.

Potential extensions include multi-stage stacking designs, dynamical model integration, and feature-weighted stacking schemes for nuanced residual correction.

6. Theoretical and Algorithmic Connections to Residual Component Analysis (RCA)

RCA (Residual Component Analysis) generalizes PPCA by decomposing observed variance into structured (explained) and residual components using a generalized eigenvalue problem (Kalaitzis et al., 2012):

RCA solves $YY^\top S = ESD$ for $S$ (generalized eigenvectors) and $D$ (eigenvalues) with $E$ encoding known covariance structure, yielding latent subspaces that capture post-explanation residual variance.
The dual decomposition into a low-rank and sparse-inverse covariance factor links RSGL’s hierarchical residual modeling principles to broader latent variable frameworks and Gaussian graphical models.
Iterative EM/RCA hybrid algorithms alternate between latent confounder updates and residual structure estimation, providing interpretable decomposition in complex datasets (protein networks, gene expression, human pose estimation).

A plausible implication is that RSGL's stacked layer approach mirrors iterative extraction of residual structure analogous to RCA steps, particularly when multiple residual components exist at different scales or domains.

7. Practical Applications and Generalization

RSGL is applicable in:

Financial Forecasting: Handling volatile, non-stationary asset series with robust long-range dependency modeling capabilities.
Epidemiological Forecasting: Estimating disease risk (Influenza-like Illness, malaria prevalence) under data scarcity and complex covariate interactions.
Benchmark Multivariate Forecasting: Model suitability for electricity load, weather, and traffic prediction tasks where computational efficiency and data scalability are paramount.

Challenges include sensitivity to input/output window configuration, increased computational requirements with deeper stacking, and potential vulnerability to regime shifts in real-world nonstationary sequences. Further research is suggested to optimize normalization and residual modeling under complex data regimes.

Summary Table: RSGL Model Attributes

Attribute	RSGL Model Description	Comparative Aspect
Architecture Depth	Four stacked linear residual blocks with GeLU and dropout	Deeper than GLinear
Residual Handling	Block-wise skip connection, implicit identity mapping preservation	Hierarchical residual extraction
Domain Adaptation	RevIN normalization for input/output shifts	Improved non-stationarity handling
Performance	Superior long-range accuracy on several benchmarks	Competitive with Transformers
Limitation	Equal input–output window decreases benefit	Context-dependent
Theoretical Basis	High-dimensional variance/SNR estimator; links to RCA decomposition	No sparsity requirement

The RSGL model provides a computationally lightweight, data-efficient, and technically robust solution for multivariate time series forecasting and regression in high dimensions. Its layered, residual design, supported by empirical and theoretical results, offers a pragmatic alternative to more complex nonlinear architectures, while remaining sensitive to domain, data characteristics, and stacking depth.