SutraNets: Long-Seq Probabilistic Forecasting
- SutraNets are neural probabilistic models that reframe long-sequence forecasting as multivariate prediction over interleaved sub-series to reduce error accumulation.
- They employ a novel likelihood factorization using dedicated RNN or transformer blocks, enabling parallel training and efficient computation.
- Empirical results on real-world benchmarks demonstrate up to 15% ND improvement, preserving forecast coherence with comparable computational cost.
SutraNets are a neural probabilistic forecasting framework designed to address the challenges of long-sequence forecasting in univariate time series. By reorganizing a univariate prediction problem as a multivariate prediction over interleaved, lower-frequency sub-series, SutraNets factorize the likelihood in such a way that error accumulation and long-distance dependency issues associated with traditional autoregressive (AR) models are significantly mitigated. SutraNets have demonstrated improved accuracy and coherence in probabilistic forecasts for long sequences across multiple real-world benchmarks while maintaining computational efficiency comparable to standard AR models (Bergsma et al., 2023).
1. Motivation and Problem Definition
Long-sequence probabilistic forecasting seeks the joint distribution of future values given observed values and relevant covariates, formally expressed as:
Conventional AR models factorize this as
During inference, prediction proceeds stepwise, where each prediction conditions on previously sampled (and possibly erroneous) values—resulting in error accumulation. Additionally, recurrent models (e.g., RNNs) suffer from vanishing or diluted signals when tasked with propagating historical information across hundreds or thousands of steps (the "signal-path" problem). Addressing both error snowballing and weakened long-range dependency modeling is critical for reliable long-sequence forecasting (Bergsma et al., 2023).
2. Likelihood Factorization via Sub-Series
SutraNets introduce a novel likelihood factorization. The observed univariate sequence is partitioned into interleaved sub-series, each of length , defined for by:
Applying the chain rule first across future time then across sub-series index yields:
Each factor is parameterized via a dedicated RNN (or transformer block), evolving the hidden state by:
0
and outputting parameter vector 1. This allows each sub-series to focus on predicting 2-spaced observations, reducing the effective generative stride (Bergsma et al., 2023).
3. Construction of Interleaved Sub-Series
The initial sequence 3 is split into 4 interleaved sub-series:
- 5
- 6
- 7
- 8
Selection of 9 typically aligns with the primary seasonal period when known (e.g., 0 for 24-hour seasonality in hourly data, 1 for MNIST row-periodicity). Each sub-series is forecast over a reduced horizon 2, supporting improved sample efficiency and reduced error propagation (Bergsma et al., 2023).
4. Generative Orderings and Algorithmic Variants
SutraNets support two principal generative orderings:
- Alternating (Regular-alt/Backfill-alt): At each sub-series time 3, predictions cycle through 4 to 5, conditioning on current-6 sub-series and past values of all sub-series. In "alt" variants, additional dependencies on 7 are included.
- Non-alternating (Regular-non/Backfill-non): Each sub-series is generated in full sequentially, conditioning only on past values.
Each RNN is updated per sub-series per step, yielding a computational stride of 8 compared to 9 for conventional AR models. During training, parallelism is enabled since all sub-series have access to true target values; this yields a 0-fold throughput gain under parallel hardware (Bergsma et al., 2023).
5. Architectural Characteristics
- Backbone: LSTM (1–4 layers, 64–256 hidden units) or Transformer with local attention.
- Output distribution: Three-level coarse-to-fine discretization (12 bins per level, with Pareto tails), as developed for C2FAR, enabling flexible modeling of continuous and heavy-tailed distributions.
- Input encoding: Sub-series values and covariates undergo min–max normalization and C2F encoding.
- Regularization: Input dropout (1), inter-layer dropout, weight decay, and early stopping.
- Computational complexity: Each sub-series RNN step incurs 2 operations, run for 3 steps, aggregating to overall memory and time comparable to a standard RNN of hidden size 4. Training can execute all 5 sub-series in parallel (Bergsma et al., 2023).
6. Reduction of Error Accumulation and Signal Path Length
By jointly predicting every 6th point, SutraNets induce a 7-fold enlargement of the generative stride, directly reducing sequential prediction steps and hence error accumulation by a factor of 8. The effective recurrent signal path—i.e., the number of steps a past value must propagate through to inform the current prediction—is reduced from 9 in standard AR models to 0 in SutraNets. Empirically, neither enlarged generative stride nor shortened signal path alone achieves the observed performance gains; both are required in concert (Bergsma et al., 2023).
7. Empirical Performance and Applications
SutraNets were benchmarked on six real-world datasets: 5-minutely cloud-VM demand (1), hourly electricity (2), hourly traffic (3), daily Wikipedia hits (4), and sequential MNIST (original and permuted, 5). Primary evaluation metrics included normalized deviation (ND) of the median forecast and weighted quantile loss (wQL) at nine quantiles. Relative to the C2FAR baseline, backfill-alternating SutraNets achieved mean ND reductions of approximately 15%, with the following summary improvements:
| Dataset | Baseline ND | SutraNet ND |
|---|---|---|
| Azure VM (5-min) | 3.2% | 2.5% |
| Electricity (hourly) | 10.6% | 9.3% |
| Traffic (hourly) | 19.3% | 15.3% |
| Wiki daily | 31.1% | 30.1% |
| MNIST (original) | 67.9% | 64.4% |
| MNIST (permuted) | 100.0% | 72.0% |
SutraNets preserved probabilistic coherence and incurred no additional training or inference cost for a fixed model size, while providing a general-purpose wrapper around existing sequence models. This design delivers state-of-the-art performance for long-sequence probabilistic forecasting by interleaving sub-series predictions to mitigate central autoregressive pathologies (Bergsma et al., 2023).