Papers
Topics
Authors
Recent
Search
2000 character limit reached

SARIMA Simulators for Synthetic Time Series

Updated 7 June 2026
  • SARIMA-based simulators are computational frameworks that use seasonal and nonseasonal ARIMA components to simulate realistic time series with modular, noise-enhanced properties.
  • They integrate classical and modern simulation workflows by combining stable-pole sampling, multi-seasonal blending, and heavy-tailed noise addition for robust counterfactual and zero-shot forecasting.
  • These simulators enable high-throughput generation of synthetic data, optimizing model validations and pretraining in diverse applications ranging from economic policy to meteorological forecasting.

A SARIMA-based simulator is a computational framework that generates synthetic time series data or conditional trajectories by leveraging the Seasonal Autoregressive Integrated Moving Average (SARIMA) model family. These simulators provide both physically motivated and statistically flexible surrogates for real-world time series, supporting use cases such as model validation, scenario analysis, privacy-preserving benchmarking, and, more recently, large-scale simulation for zero-shot forecasting in machine learning. Recent advances have led to highly modular SARIMA-based simulators, which couple classical time series methodology with algorithmic innovations that address stability, multi-seasonality, and non-Gaussian noise characteristics.

1. Formal Structure of the SARIMA Model

The SARIMA (p,d,q)(P,D,Q)s(p, d, q)(P, D, Q)_s model extends the conventional ARIMA framework by integrating seasonal differencing and seasonal autoregressive and moving average terms. Its general operator form is

(1−∑i=1pϕiLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t

where LL denotes the lag operator, ss is the seasonality period, ϕi\phi_i, θi\theta_i, Φj\Phi_j, Θj\Theta_j are the non-seasonal and seasonal AR and MA coefficients, and εt\varepsilon_t is an i.i.d. innovation, often Gaussian but potentially from a heavier-tailed distribution in advanced simulators. Stability and invertibility are imposed by constraining the characteristic roots of the AR and MA polynomials to lie inside the unit circle, with explicit construction via polynomial root sampling in modern simulators such as SarSim0 (Oreshkin et al., 2 Jan 2026).

2. Simulation Workflows and Algorithmic Variants

Classical Conditional SARIMA Simulation

Traditional SARIMA-based simulators, as detailed in "Simulating the Continuation of a Time Series in R" (Sak et al., 2012), operate by:

  • Estimating model parameters (typically by maximum likelihood or information criteria-driven model selection).
  • Conditioning on observed data up to time tt, then recursively simulating (1−∑i=1pÏ•iLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t0 for (1−∑i=1pÏ•iLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t1 from (1−∑i=1pÏ•iLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t2, propagating the model recursion with stored lagged values, and inverting differencing to obtain simulated trajectories.
  • Generating multiple Monte Carlo sample paths to approximate the future distribution, which enables quantile-based uncertainty estimation.

This approach is widely implemented in open-source libraries (e.g., R's arima.condsim/simulate(), Python's statsmodels.[SARIMAX](https://www.emergentmind.com/topics/sarimax-sarima-with-exogenous-variables).simulate()) and forms the foundation of simulation-based counterfactual forecasting (Andrade et al., 10 Mar 2025).

Modern Modular SARIMA Simulators

SarSim0 (Oreshkin et al., 2 Jan 2026) exemplifies recent advances suitable for high-throughput simulation, with a three-stage procedure:

  1. Stable-pole sampling: Randomly sample SARIMA model orders and AR/MA polynomial roots within strict stability regions (all (1−∑i=1pϕiLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t3).
  2. Multi-seasonal superposition: Combine independent SARIMA trajectories (with distinct seasonalities) via additive or multiplicative interactions to yield richer, nontrivial seasonal patterns typical of industrial time series.
  3. Non-Gaussian noise addition: Augment the deterministic output with heavy-tailed rate-based noise (Poisson, generalized Gamma, or log-normal), parametrized by a level-dependent rate (1−∑i=1pϕiLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t4, to embed burstiness and intermittency.

This process supports vectorized generation of billions of synthetic series on modern accelerators, matching the scale required for pretraining foundation models under zero-shot protocols.

3. Order Selection, Model Fitting, and Diagnostics

Order selection is pivotal for building well-calibrated SARIMA-based simulators. Standard workflows involve:

  • Defining candidate ranges for nonseasonal ((1−∑i=1pÏ•iLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t5) and seasonal ((1−∑i=1pÏ•iLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t6) orders, and differencing degrees ((1−∑i=1pÏ•iLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t7), as in (1−∑i=1pÏ•iLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t8 for each AR or MA parameter and (1−∑i=1pÏ•iLi−∑j=1PΦjLjs)(1−L)d(1−Ls)Dyt=(1+∑i=1qθiLi+∑j=1QΘjLjs)εt(1 - \sum_{i=1}^p \phi_i L^i - \sum_{j=1}^P \Phi_j L^{js}) (1-L)^d (1-L^s)^D y_t = (1 + \sum_{i=1}^q \theta_i L^i + \sum_{j=1}^Q \Theta_j L^{js}) \varepsilon_t9 (Andrade et al., 10 Mar 2025).
  • Fitting each candidate model via conditional or exact maximum likelihood, often automated by stepwise selection procedures and penalized likelihood criteria (AIC, BIC, AICc).
  • Holding out a test window for one-step-ahead forecasting MSE.
  • Evaluating model fitness using RMSE, MAPE, residual whiteness tests (Box-Pierce/Ljung), and normality assessments (Kolmogorov–Smirnov).

Residuals must exhibit white-noise characteristics and stationarity for the simulated continuations to be credible. Model mis-specification, detected by residual autocorrelation, non-stationarity, or skewed error distributions, can bias simulations (Sak et al., 2012).

4. Applications and Empirical Performance

SARIMA-based simulators are deployed across diverse domains including economic policy evaluation, meteorological forecasting, and machine learning benchmark generation.

  • In counterfactual economic forecasting, such as in projections for currency circulation under pandemic scenarios, SARIMA-based simulators enable the generation of hypothetical time series (e.g., simulating 2020–2022 without COVID-19 shocks), leveraging Monte Carlo resampling for confidence interval estimation (Andrade et al., 10 Mar 2025).
  • In hybrid statistical–ML pipelines, SARIMA-based simulators act as baseline generators or trend extractors, supporting advanced models such as hybrid SARIMA–LSTM networks for weather forecasting. Here, the SARIMA unit is used to model the robust seasonal component, and an LSTM residual learner is trained on the remaining non-linear error series, substantially reducing long-term forecast error versus either model in isolation (Rajeev et al., 12 Jan 2026).
  • For foundation model pretraining, SarSim0 facilitates the generation of circa 1B unique, leakage-free synthetic time series on-the-fly, enabling strict zero-shot evaluation and yielding "student-beats-teacher" phenomena, where pretrained neural models surpass the ARIMA process used for simulation on established forecasting benchmarks (GiftEval and M-Series) (Oreshkin et al., 2 Jan 2026).

The quantitative impact is summarized in the following table:

Model+Simulator GiftEval sCRPS / MASE M-Series sCRPS / MASE
AutoARIMA 0.912 / 1.074 0.096 / 0.843
NBEATS + SarSim0 0.602 / 0.849 0.096 / 0.869
PatchTST + SarSim0 0.573 / 0.837 0.097 / 0.877
Chronos-Small + SarSim0 0.608 / 0.878 0.100 / 0.896

On GiftEval, SARIMA-based simulation for zero-shot pretraining produces time-series models that decisively outperform the AutoARIMA generator itself, evidence of a robust simulation-based paradigm (Oreshkin et al., 2 Jan 2026).

5. Enhancements: Non-Gaussian Innovations and Exogenous Variables

Recent advancements have extended SARIMA-based simulators to support greater realism:

  • Heavy-tailed and rate-based innovations: Model innovations are drawn from Poisson, generalized Gamma, or log-normal distributions, with rate parameters dependent on the signal level, capturing phenomena such as burstiness and temporal intermittency (Oreshkin et al., 2 Jan 2026). This is critical for domains where Gaussian assumptions do not hold.
  • Exogenous input integration (SARIMAX): When covariates such as macroeconomic indicators, meteorological variables, or policy variables are available, the SARIMA framework is extended to SARIMAX, introducing regression on contemporaneous and lagged exogenous variables. Rigorous pretesting for stationarity and collinearity, as well as careful lag selection, are required for consistent simulation of exogenous impacts (Andrade et al., 10 Mar 2025, Rajeev et al., 12 Jan 2026).
  • Modeling regime shifts or GARCH-style volatility: Suggestions for future SARIMA-based simulation include incorporating regime-switching mechanisms and GARCH residual models to address structural breaks and volatility clustering, though standard workflows remain primarily univariate and assume homoskedasticity (Andrade et al., 10 Mar 2025).

6. Computational Considerations and Implementation

The computational efficiency of SARIMA-based simulators depends on parameterization and algorithmic design:

  • The core SARIMA simulation recursion is LL0 per series, vectorizable across massive batches (LL1) for parallelized generation on accelerators (Oreshkin et al., 2 Jan 2026).
  • In contrast, kernel-based generators (e.g., Gaussian Process with structured kernels) scale as LL2 per series due to covariance matrix factorization, limiting practical throughput.
  • For example, SarSim0 generates a LL3-step series in less than LL4 ms, over LL5 faster than classical kernel-synthesis methods (LL6–LL7 ms) (Oreshkin et al., 2 Jan 2026).
  • Efficient simulation routines for conditional trajectories, as exemplified by arima.condsim in R (Sak et al., 2012), require correct management of lag initialization, innovative seed generation, and sequential inversion of differencing for proper path realization.

Practical implementation is facilitated in both Python (via statsmodels.SARIMAX.simulate) and R (simulate.Arima, arima.condsim), with workflows that allow for simulation, diagnostics, and Monte Carlo uncertainty quantification.

7. Limitations and Prospects

Limitations of SARIMA-based simulators include:

  • Inability to model nonlinear or highly nonstationary transitions unless explicitly augmented (e.g., via LSTM-residuals or regime-switching mechanisms) (Rajeev et al., 12 Jan 2026).
  • Long-term forecasts subject to increasing uncertainty, especially in the presence of structural breaks or exogenous regime shifts (Andrade et al., 10 Mar 2025).
  • For highly multivariate problems, univariate SARIMA-based simulation omits cross-series dependencies unless generalized to the vector autoregressive (VARIMA) context.

Future directions include expanding to multivariate SARIMAX with covariate-driven dynamics, online structural-break detection, and coupling with differentiable simulators for neural time-series model pretraining (Oreshkin et al., 2 Jan 2026, Rajeev et al., 12 Jan 2026).


SARIMA-based simulators provide a principled and computationally tractable foundation for synthetic time series generation, rigorous counterfactual analysis, and enablement of zero-shot machine learning workflows through scalable modularization, advanced noise modeling, and integration with modern deep learning residual frameworks (Sak et al., 2012, Andrade et al., 10 Mar 2025, Oreshkin et al., 2 Jan 2026, Rajeev et al., 12 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SARIMA-Based Simulators.