SARIMA Simulators for Synthetic Time Series
- SARIMA-based simulators are computational frameworks that use seasonal and nonseasonal ARIMA components to simulate realistic time series with modular, noise-enhanced properties.
- They integrate classical and modern simulation workflows by combining stable-pole sampling, multi-seasonal blending, and heavy-tailed noise addition for robust counterfactual and zero-shot forecasting.
- These simulators enable high-throughput generation of synthetic data, optimizing model validations and pretraining in diverse applications ranging from economic policy to meteorological forecasting.
A SARIMA-based simulator is a computational framework that generates synthetic time series data or conditional trajectories by leveraging the Seasonal Autoregressive Integrated Moving Average (SARIMA) model family. These simulators provide both physically motivated and statistically flexible surrogates for real-world time series, supporting use cases such as model validation, scenario analysis, privacy-preserving benchmarking, and, more recently, large-scale simulation for zero-shot forecasting in machine learning. Recent advances have led to highly modular SARIMA-based simulators, which couple classical time series methodology with algorithmic innovations that address stability, multi-seasonality, and non-Gaussian noise characteristics.
1. Formal Structure of the SARIMA Model
The SARIMA model extends the conventional ARIMA framework by integrating seasonal differencing and seasonal autoregressive and moving average terms. Its general operator form is
where denotes the lag operator, is the seasonality period, , , , are the non-seasonal and seasonal AR and MA coefficients, and is an i.i.d. innovation, often Gaussian but potentially from a heavier-tailed distribution in advanced simulators. Stability and invertibility are imposed by constraining the characteristic roots of the AR and MA polynomials to lie inside the unit circle, with explicit construction via polynomial root sampling in modern simulators such as SarSim0 (Oreshkin et al., 2 Jan 2026).
2. Simulation Workflows and Algorithmic Variants
Classical Conditional SARIMA Simulation
Traditional SARIMA-based simulators, as detailed in "Simulating the Continuation of a Time Series in R" (Sak et al., 2012), operate by:
- Estimating model parameters (typically by maximum likelihood or information criteria-driven model selection).
- Conditioning on observed data up to time , then recursively simulating 0 for 1 from 2, propagating the model recursion with stored lagged values, and inverting differencing to obtain simulated trajectories.
- Generating multiple Monte Carlo sample paths to approximate the future distribution, which enables quantile-based uncertainty estimation.
This approach is widely implemented in open-source libraries (e.g., R's arima.condsim/simulate(), Python's statsmodels.[SARIMAX](https://www.emergentmind.com/topics/sarimax-sarima-with-exogenous-variables).simulate()) and forms the foundation of simulation-based counterfactual forecasting (Andrade et al., 10 Mar 2025).
Modern Modular SARIMA Simulators
SarSim0 (Oreshkin et al., 2 Jan 2026) exemplifies recent advances suitable for high-throughput simulation, with a three-stage procedure:
- Stable-pole sampling: Randomly sample SARIMA model orders and AR/MA polynomial roots within strict stability regions (all 3).
- Multi-seasonal superposition: Combine independent SARIMA trajectories (with distinct seasonalities) via additive or multiplicative interactions to yield richer, nontrivial seasonal patterns typical of industrial time series.
- Non-Gaussian noise addition: Augment the deterministic output with heavy-tailed rate-based noise (Poisson, generalized Gamma, or log-normal), parametrized by a level-dependent rate 4, to embed burstiness and intermittency.
This process supports vectorized generation of billions of synthetic series on modern accelerators, matching the scale required for pretraining foundation models under zero-shot protocols.
3. Order Selection, Model Fitting, and Diagnostics
Order selection is pivotal for building well-calibrated SARIMA-based simulators. Standard workflows involve:
- Defining candidate ranges for nonseasonal (5) and seasonal (6) orders, and differencing degrees (7), as in 8 for each AR or MA parameter and 9 (Andrade et al., 10 Mar 2025).
- Fitting each candidate model via conditional or exact maximum likelihood, often automated by stepwise selection procedures and penalized likelihood criteria (AIC, BIC, AICc).
- Holding out a test window for one-step-ahead forecasting MSE.
- Evaluating model fitness using RMSE, MAPE, residual whiteness tests (Box-Pierce/Ljung), and normality assessments (Kolmogorov–Smirnov).
Residuals must exhibit white-noise characteristics and stationarity for the simulated continuations to be credible. Model mis-specification, detected by residual autocorrelation, non-stationarity, or skewed error distributions, can bias simulations (Sak et al., 2012).
4. Applications and Empirical Performance
SARIMA-based simulators are deployed across diverse domains including economic policy evaluation, meteorological forecasting, and machine learning benchmark generation.
- In counterfactual economic forecasting, such as in projections for currency circulation under pandemic scenarios, SARIMA-based simulators enable the generation of hypothetical time series (e.g., simulating 2020–2022 without COVID-19 shocks), leveraging Monte Carlo resampling for confidence interval estimation (Andrade et al., 10 Mar 2025).
- In hybrid statistical–ML pipelines, SARIMA-based simulators act as baseline generators or trend extractors, supporting advanced models such as hybrid SARIMA–LSTM networks for weather forecasting. Here, the SARIMA unit is used to model the robust seasonal component, and an LSTM residual learner is trained on the remaining non-linear error series, substantially reducing long-term forecast error versus either model in isolation (Rajeev et al., 12 Jan 2026).
- For foundation model pretraining, SarSim0 facilitates the generation of circa 1B unique, leakage-free synthetic time series on-the-fly, enabling strict zero-shot evaluation and yielding "student-beats-teacher" phenomena, where pretrained neural models surpass the ARIMA process used for simulation on established forecasting benchmarks (GiftEval and M-Series) (Oreshkin et al., 2 Jan 2026).
The quantitative impact is summarized in the following table:
| Model+Simulator | GiftEval sCRPS / MASE | M-Series sCRPS / MASE |
|---|---|---|
| AutoARIMA | 0.912 / 1.074 | 0.096 / 0.843 |
| NBEATS + SarSim0 | 0.602 / 0.849 | 0.096 / 0.869 |
| PatchTST + SarSim0 | 0.573 / 0.837 | 0.097 / 0.877 |
| Chronos-Small + SarSim0 | 0.608 / 0.878 | 0.100 / 0.896 |
On GiftEval, SARIMA-based simulation for zero-shot pretraining produces time-series models that decisively outperform the AutoARIMA generator itself, evidence of a robust simulation-based paradigm (Oreshkin et al., 2 Jan 2026).
5. Enhancements: Non-Gaussian Innovations and Exogenous Variables
Recent advancements have extended SARIMA-based simulators to support greater realism:
- Heavy-tailed and rate-based innovations: Model innovations are drawn from Poisson, generalized Gamma, or log-normal distributions, with rate parameters dependent on the signal level, capturing phenomena such as burstiness and temporal intermittency (Oreshkin et al., 2 Jan 2026). This is critical for domains where Gaussian assumptions do not hold.
- Exogenous input integration (SARIMAX): When covariates such as macroeconomic indicators, meteorological variables, or policy variables are available, the SARIMA framework is extended to SARIMAX, introducing regression on contemporaneous and lagged exogenous variables. Rigorous pretesting for stationarity and collinearity, as well as careful lag selection, are required for consistent simulation of exogenous impacts (Andrade et al., 10 Mar 2025, Rajeev et al., 12 Jan 2026).
- Modeling regime shifts or GARCH-style volatility: Suggestions for future SARIMA-based simulation include incorporating regime-switching mechanisms and GARCH residual models to address structural breaks and volatility clustering, though standard workflows remain primarily univariate and assume homoskedasticity (Andrade et al., 10 Mar 2025).
6. Computational Considerations and Implementation
The computational efficiency of SARIMA-based simulators depends on parameterization and algorithmic design:
- The core SARIMA simulation recursion is 0 per series, vectorizable across massive batches (1) for parallelized generation on accelerators (Oreshkin et al., 2 Jan 2026).
- In contrast, kernel-based generators (e.g., Gaussian Process with structured kernels) scale as 2 per series due to covariance matrix factorization, limiting practical throughput.
- For example, SarSim0 generates a 3-step series in less than 4 ms, over 5 faster than classical kernel-synthesis methods (6–7 ms) (Oreshkin et al., 2 Jan 2026).
- Efficient simulation routines for conditional trajectories, as exemplified by
arima.condsimin R (Sak et al., 2012), require correct management of lag initialization, innovative seed generation, and sequential inversion of differencing for proper path realization.
Practical implementation is facilitated in both Python (via statsmodels.SARIMAX.simulate) and R (simulate.Arima, arima.condsim), with workflows that allow for simulation, diagnostics, and Monte Carlo uncertainty quantification.
7. Limitations and Prospects
Limitations of SARIMA-based simulators include:
- Inability to model nonlinear or highly nonstationary transitions unless explicitly augmented (e.g., via LSTM-residuals or regime-switching mechanisms) (Rajeev et al., 12 Jan 2026).
- Long-term forecasts subject to increasing uncertainty, especially in the presence of structural breaks or exogenous regime shifts (Andrade et al., 10 Mar 2025).
- For highly multivariate problems, univariate SARIMA-based simulation omits cross-series dependencies unless generalized to the vector autoregressive (VARIMA) context.
Future directions include expanding to multivariate SARIMAX with covariate-driven dynamics, online structural-break detection, and coupling with differentiable simulators for neural time-series model pretraining (Oreshkin et al., 2 Jan 2026, Rajeev et al., 12 Jan 2026).
SARIMA-based simulators provide a principled and computationally tractable foundation for synthetic time series generation, rigorous counterfactual analysis, and enablement of zero-shot machine learning workflows through scalable modularization, advanced noise modeling, and integration with modern deep learning residual frameworks (Sak et al., 2012, Andrade et al., 10 Mar 2025, Oreshkin et al., 2 Jan 2026, Rajeev et al., 12 Jan 2026).