SutraNets: Long-Seq Probabilistic Forecasting

Updated 20 May 2026

SutraNets are neural probabilistic models that reframe long-sequence forecasting as multivariate prediction over interleaved sub-series to reduce error accumulation.
They employ a novel likelihood factorization using dedicated RNN or transformer blocks, enabling parallel training and efficient computation.
Empirical results on real-world benchmarks demonstrate up to 15% ND improvement, preserving forecast coherence with comparable computational cost.

SutraNets are a neural probabilistic forecasting framework designed to address the challenges of long-sequence forecasting in univariate time series. By reorganizing a univariate prediction problem as a multivariate prediction over interleaved, lower-frequency sub-series, SutraNets factorize the likelihood in such a way that error accumulation and long-distance dependency issues associated with traditional autoregressive (AR) models are significantly mitigated. SutraNets have demonstrated improved accuracy and coherence in probabilistic forecasts for long sequences across multiple real-world benchmarks while maintaining computational efficiency comparable to standard AR models (Bergsma et al., 2023).

1. Motivation and Problem Definition

Long-sequence probabilistic forecasting seeks the joint distribution of $N$ future values given $T$ observed values and relevant covariates, formally expressed as:

$p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$

Conventional AR models factorize this as

$\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$

During inference, prediction proceeds stepwise, where each prediction conditions on previously sampled (and possibly erroneous) values—resulting in error accumulation. Additionally, recurrent models (e.g., RNNs) suffer from vanishing or diluted signals when tasked with propagating historical information across hundreds or thousands of steps (the "signal-path" problem). Addressing both error snowballing and weakened long-range dependency modeling is critical for reliable long-sequence forecasting (Bergsma et al., 2023).

2. Likelihood Factorization via Sub-Series

SutraNets introduce a novel likelihood factorization. The observed univariate sequence is partitioned into $K$ interleaved sub-series, each of length $(T+N)/K$ , defined for $k=1,\dots,K$ by:

$y^k_t = y_{k + (t-1)K}, \quad t=1, \dots, (T+N)/K$

Applying the chain rule first across future time then across sub-series index yields:

$p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N}) = \prod_{t=T+1}^{T+N} \prod_{k=1}^{K} p\left( y_t^k \mid y_{1:t}^{<k}, y_{1:t-1}^k, y_{1:t-1}^{>k}, x_{1:T+N} \right)$

Each factor $p(y_t^k \mid \cdot)$ is parameterized via a dedicated RNN (or transformer block), evolving the hidden state by:

$T$ 0

and outputting parameter vector $T$ 1. This allows each sub-series to focus on predicting $T$ 2-spaced observations, reducing the effective generative stride (Bergsma et al., 2023).

3. Construction of Interleaved Sub-Series

The initial sequence $T$ 3 is split into $T$ 4 interleaved sub-series:

$T$ 5
$T$ 6
$T$ 7
$T$ 8

Selection of $T$ 9 typically aligns with the primary seasonal period when known (e.g., $p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$ 0 for 24-hour seasonality in hourly data, $p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$ 1 for MNIST row-periodicity). Each sub-series is forecast over a reduced horizon $p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$ 2, supporting improved sample efficiency and reduced error propagation (Bergsma et al., 2023).

4. Generative Orderings and Algorithmic Variants

SutraNets support two principal generative orderings:

Alternating (Regular-alt/Backfill-alt): At each sub-series time $p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$ 3, predictions cycle through $p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$ 4 to $p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$ 5, conditioning on current- $p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$ 6 sub-series and past values of all sub-series. In "alt" variants, additional dependencies on $p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$ 7 are included.
Non-alternating (Regular-non/Backfill-non): Each sub-series is generated in full sequentially, conditioning only on past values.

Each RNN is updated per sub-series per step, yielding a computational stride of $p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$ 8 compared to $p(y_{T+1:T+N} \mid y_{1:T}, x_{1:T+N})$ 9 for conventional AR models. During training, parallelism is enabled since all sub-series have access to true target values; this yields a $\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$ 0-fold throughput gain under parallel hardware (Bergsma et al., 2023).

5. Architectural Characteristics

Backbone: LSTM (1–4 layers, 64–256 hidden units) or Transformer with local attention.
Output distribution: Three-level coarse-to-fine discretization (12 bins per level, with Pareto tails), as developed for C2FAR, enabling flexible modeling of continuous and heavy-tailed distributions.
Input encoding: Sub-series values and covariates undergo min–max normalization and C2F encoding.
Regularization: Input dropout ( $\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$ 1), inter-layer dropout, weight decay, and early stopping.
Computational complexity: Each sub-series RNN step incurs $\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$ 2 operations, run for $\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$ 3 steps, aggregating to overall memory and time comparable to a standard RNN of hidden size $\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$ 4. Training can execute all $\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$ 5 sub-series in parallel (Bergsma et al., 2023).

6. Reduction of Error Accumulation and Signal Path Length

By jointly predicting every $\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$ 6th point, SutraNets induce a $\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$ 7-fold enlargement of the generative stride, directly reducing sequential prediction steps and hence error accumulation by a factor of $\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$ 8. The effective recurrent signal path—i.e., the number of steps a past value must propagate through to inform the current prediction—is reduced from $\prod_{t=T+1}^{T+N} p(y_t \mid y_{1:t-1}, x_{1:T+N})$ 9 in standard AR models to $K$ 0 in SutraNets. Empirically, neither enlarged generative stride nor shortened signal path alone achieves the observed performance gains; both are required in concert (Bergsma et al., 2023).

7. Empirical Performance and Applications

SutraNets were benchmarked on six real-world datasets: 5-minutely cloud-VM demand ( $K$ 1), hourly electricity ( $K$ 2), hourly traffic ( $K$ 3), daily Wikipedia hits ( $K$ 4), and sequential MNIST (original and permuted, $K$ 5). Primary evaluation metrics included normalized deviation (ND) of the median forecast and weighted quantile loss (wQL) at nine quantiles. Relative to the C2FAR baseline, backfill-alternating SutraNets achieved mean ND reductions of approximately 15%, with the following summary improvements:

Dataset	Baseline ND	SutraNet ND
Azure VM (5-min)	3.2%	2.5%
Electricity (hourly)	10.6%	9.3%
Traffic (hourly)	19.3%	15.3%
Wiki daily	31.1%	30.1%
MNIST (original)	67.9%	64.4%
MNIST (permuted)	100.0%	72.0%

SutraNets preserved probabilistic coherence and incurred no additional training or inference cost for a fixed model size, while providing a general-purpose wrapper around existing sequence models. This design delivers state-of-the-art performance for long-sequence probabilistic forecasting by interleaving sub-series predictions to mitigate central autoregressive pathologies (Bergsma et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

SutraNets: Sub-series Autoregressive Networks for Long-Sequence, Probabilistic Forecasting (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SutraNets.