DeepVARwT: Adaptive Shrinkage in VAR Models

Updated 10 February 2026

The paper introduces a joint estimation framework that integrates trend and VAR coefficient estimation into a single maximum likelihood optimization to propagate uncertainty.
DeepVARwT employs an LSTM network to derive both trend components and preliminary VAR parameters, with causality enforced through the Ansley–Kohn transform.
Empirical results demonstrate that the method reduces bias, mean squared error, and forecast errors compared to traditional two-stage approaches.

A locally adaptive shrinkage technique in time series analysis aims to estimate model components—such as trends or dependence structure—while adaptively regularizing or controlling model complexity over time, typically to accommodate nonstationarity, local regime changes, or persistent uncertainties. The DeepVARwT framework ("Deep Learning for a VAR Model with Trend") realizes this philosophy by leveraging deep recurrent architectures to jointly estimate both deterministic trends and the inter-series dependence structure in a vector autoregressive (VAR) system, with explicit enforcement of causality and maximum likelihood as the estimation backbone (Li et al., 2022).

1. Model Structure: VAR with Deterministic Trend

DeepVARwT models a multivariate time series $\mathbf y_t \in \mathbb R^m$ using the VAR( $p$ ) process with a time-dependent mean: $\mathbf y_t = \bm\mu_t + A_1\,\mathbf y_{t-1} + \cdots + A_p\,\mathbf y_{t-p} + \bm\varepsilon_t$ where $\bm\varepsilon_t \stackrel{iid}{\sim} N(\mathbf 0, \Sigma)$ , $A_i \in \mathbb R^{m \times m}$ are coefficient matrices, and $\bm\mu_t$ is a deterministic, possibly nonlinear trend.

Traditional approaches isolate trend estimation (e.g., via polynomials or kernel smoothing) and then fit the VAR, but this two-stage procedure neglects the uncertainty in trend estimation, which propagates bias into coefficient inference and forecasting, especially in the latter part of the series. DeepVARwT instead poses a joint estimation problem for both the trend field $\bm\mu_t$ and VAR parameters $(A_i, \Sigma)$ , propagating uncertainty across components in a single maximum likelihood optimization (Li et al., 2022).

2. LSTM Architecture for Joint Trend and VAR Parameter Estimation

DeepVARwT employs a Long Short-Term Memory (LSTM) network to process known time-based features, such as $\mathbf x_t = (t, t^2, t^3, 1/t, 1/t^2, 1/t^3)^\top$ , generating at each time step $t$ a hidden state $\mathbf h_t \in \mathbb R^d$ using the standard LSTM recursion: $\begin{aligned} \mathbf i_t &=\sigma(W_{xi}\,\mathbf x_t +W_{hi}\,\mathbf h_{t-1}+b_i) \ \mathbf f_t &=\sigma(W_{xf}\,\mathbf x_t +W_{hf}\,\mathbf h_{t-1}+b_f) \ \tilde{\mathbf c}_t &=\tanh(W_{xc}\,\mathbf x_t +W_{hc}\,\mathbf h_{t-1}+b_c) \ \mathbf c_t &=\mathbf f_t\odot \mathbf c_{t-1} + \mathbf i_t\odot \tilde{\mathbf c}_t \ \mathbf o_t &=\sigma(W_{xo}\,\mathbf x_t +W_{ho}\,\mathbf h_{t-1}+b_o) \ \mathbf h_t &=\mathbf o_t\odot \tanh(\mathbf c_t) \end{aligned}$ The LSTM output is linearly mapped to both the trend component and preliminary VAR parameters:

$\bm\mu_t = W_\mu \mathbf h_t + \mathbf b_\mu$ with $W_\mu \in \mathbb R^{m \times d}$ , $\mathbf b_\mu \in \mathbb R^m$
A set of candidate VAR( $p$ ) coefficient matrices $\{\widetilde A_i\}$ and a Cholesky factor $L$ such that $\Sigma = LL^\top$ .

All these mappings are realized using a final fully connected layer of dimension $m^2p + m(m+1)/2$ , partitioned to output the necessary parameters for trend, VAR coefficients, and noise covariance (Li et al., 2022).

3. Likelihood-Based Joint Estimation and Optimization

The Gaussian log-likelihood for observed data $\{\mathbf y_t\}_{t=1}^T$ under the model is: $\ell(\theta) = -\frac12 \left[ (T-p)\log|\Sigma| + \sum_{t=p+1}^T (\mathbf y_t-\bm\mu_t - \sum_{i=1}^p A_i(\mathbf y_{t-i}-\bm\mu_{t-i}))' \Sigma^{-1} (\cdots) \right]$ Parameters are collectively denoted $\theta$ (LSTM weights, trend/VAR mappings, and Cholesky factor). Negative log-likelihood $\mathcal L(\theta) = -\ell(\theta)$ serves as the loss, and model parameters are optimized via back-propagation and AdaGrad. This regime ensures that trend estimation error and VAR parameter uncertainty are coherently estimated and updated at once, yielding statistically efficient inference (Li et al., 2022).

4. Causality Enforcement via Ansley–Kohn Transform

Causality, or stability, of VAR( $p$ ) entails that all roots of $\det(I - A_1 z - \cdots - A_p z^p)$ lie outside the unit circle. DeepVARwT circumvents challenging direct constraints by mapping preliminary coefficients $\{\widetilde A_i\}$ to partial autocorrelation matrices $\{P_i\}$ : $P_i = B_i^{-1} \widetilde{A}_i, \quad B_i B_i^\top = I + \widetilde{A}_i \widetilde{A}_i^\top$ The Ansley–Kohn recursion then recovers a set of guaranteed-causal $A_i$ matrices, which are consistently used in likelihood evaluation at each gradient descent step. This results in models that are automatically constrained to yield stable VAR solutions throughout training (Li et al., 2022).

5. Training Protocols and Implementation Details

The model is implemented in PyTorch, with the LSTM (for trend estimation) unrolled across time and the downstream likelihood evaluation integrated as a computational block. Two separate learning rates are used for LSTM/trend weights ( $\eta_1 \approx 10^{-3}$ ) and for VAR/Cholesky parameters ( $\eta_2 \approx 10^{-2}$ ). Initialization procedures employ (i) nonlinear least squares fit of $\mathbf y_t \approx \bm\mu_t$ for trend/LSTM parameters and (ii) an OLS VAR( $p$ ) for initial raw $\widetilde{A}_i$ and $L$ , using pre-detrended data.

Typical hyperparameters include a hidden dimension $d=20$ and VAR order $p\in\{2,3,4\}$ . Training iterates for a maximum of $K\approx 500$ updates with a convergence tolerance of $10^{-5}$ . All real-data experiments use sliding-window forecasting: a rolling window of length $T_{\rm train}$ fits the model, producing $h$ -step ahead predictions before advancing the window for the next forecast cycle (Li et al., 2022).

6. Empirical Evaluation: Simulation and Real-Data Studies

Simulation studies use VAR(2)+trend series ( $m=3$ , $T=800$ ), where coefficients and noise are derived from empirical US stock-return data and $\bm\mu_t$ reflects real-world trends via kernel smoothing. With $100$ Monte Carlo replications, trend error (MAD), coefficient bias, variance, and MSE are tabulated.

Findings include:

DeepVARwT yields trend estimates that closely follow ground truth, particularly near local extrema, outperforming high-order polynomial detrending (VARwT).
VAR coefficients are estimated with consistently smaller bias and lower total MSE than two-stage approaches (Li et al., 2022).

In real-data studies, DeepVARwT is benchmarked against VARwT (OLS-fitted polynomial trend), DeepAR [Salinas et al., 2020], and DeepState [Rangapuram et al., 2018] in three settings:

US macroeconomic data (GDP gap, inflation, Fed funds rate),
global temperature anomalies (Northern/Southern hemispheres, tropics),
further US macro (inflation, unemployment, T-bill rate).

Metrics for evaluation include Absolute Percentage Error (APE) and Scaled Interval Score (SIS) for 95% predictions. Across all cases, DeepVARwT achieved

Multi-horizon APE reductions up to $50\%$ versus VARwT,
Sharper and more accurate predictive intervals than DeepAR and DeepState for key series,
White and near-Gaussian residuals, stable parameter estimates, and improved forecast sharpness at medium-to-long horizons (Li et al., 2022).

Summary of empirical results:

Dataset / Series	DeepVARwT Gains vs. Baseline	Metrics Improved
US macro (GDP, Inflation, Fed Funds)	Up to 50% APE reduction, sharper SIS	APE, SIS
Global temperature anomalies	Lowest APE, SIS across all horizons	APE, SIS
US macro (Inflation, Unemployment)	Leading APE, SIS for Unemployment, T-bill	APE, SIS

7. Advantages, Limitations, and Extensions

DeepVARwT achieves several benefits:

Fully joint trend and VAR coefficient estimation, mitigating underestimation of error due to prior detrending,
Flexible, non-polynomial trends via the LSTM backbone,
Direct maximum likelihood estimation for statistical efficiency,
Rigorous causal enforcement for VAR via the Ansley–Kohn transform.

Known limitations include increased computational cost (relative to OLS-based VARwT), especially for long series or higher-dimensional series ( $m$ large), and the assumption of conditional Gaussian residuals.

Potential extensions discussed:

High-dimensional regularization or structured VAR for scalability,
Non-Gaussian models via variational methods or copula-based techniques (such as normalizing flows),
Time-varying VAR coefficients through time-indexed $A_i(t)$ output by the LSTM.

DeepVARwT represents an overview of traditional interpretable models and flexible deep architectures, offering improvements in multi-step forecasting and uncertainty quantification compared to two-stage detrending and other deep learning time-series models that ignore inter-series dependence (Li et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

DeepVARwT: Deep Learning for a VAR Model with Trend (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Locally Adaptive Shrinkage TATS.

DeepVARwT: Adaptive Shrinkage in VAR Models

1. Model Structure: VAR with Deterministic Trend

2. LSTM Architecture for Joint Trend and VAR Parameter Estimation

3. Likelihood-Based Joint Estimation and Optimization

4. Causality Enforcement via Ansley–Kohn Transform

5. Training Protocols and Implementation Details

6. Empirical Evaluation: Simulation and Real-Data Studies

7. Advantages, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DeepVARwT: Adaptive Shrinkage in VAR Models

1. Model Structure: VAR with Deterministic Trend

2. LSTM Architecture for Joint Trend and VAR Parameter Estimation

3. Likelihood-Based Joint Estimation and Optimization

4. Causality Enforcement via Ansley–Kohn Transform

5. Training Protocols and Implementation Details

6. Empirical Evaluation: Simulation and Real-Data Studies

7. Advantages, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research