Papers
Topics
Authors
Recent
2000 character limit reached

Cisco Time Series Model (TSM)

Updated 2 December 2025
  • Cisco Time Series Model is a univariate zero-shot forecasting architecture that ingests paired coarse- and fine-resolution time series to capture both long-range trends and short-range variations.
  • The model integrates a TimesFM decoder backbone with a unique multiresolution design, enabling accurate forecasting through simultaneous processing of 1-hour coarse and 1-minute fine contexts.
  • Trained on over 300 billion data points, TSM achieves state-of-the-art performance in observability tasks and general benchmarks, demonstrating its practical impact in real-time monitoring and capacity planning.

The Cisco Time Series Model (TSM) is a univariate zero-shot forecasting architecture developed as an extension of the TimesFM decoder-only backbone, augmented to accept multiresolution input. Designed for large-scale forecasting tasks in observability and general time series domains, TSM introduces a general architectural innovation allowing the simultaneous ingestion of paired coarse- and fine-resolution contexts. The model is trained on over 300 billion unique data points, with more than half sourced from proprietary observability datasets, and achieves state-of-the-art zero-shot forecasting performance within the Cisco observability stack while maintaining competitive accuracy on general-purpose benchmarks such as GIFT-Eval (Gou et al., 25 Nov 2025).

1. Architectural Foundation

TSM builds upon the TimesFM decoder-only backbone. In the base TimesFM framework, each univariate time series is divided into non-overlapping "patches" of length p=32p=32, each embedded using a small residual network. These patch embeddings form the token sequence input to a deep decoder-only stack of transformer layers, concluding in un-embedding layers that produce a fixed-length forecast.

The core innovation of TSM is the incorporation of multiresolution context. Rather than relying on a single fine-resolution (e.g., 1-minute) context window, TSM accepts two parallel sequences:

  • A coarse-resolution context xcRNx_c \in \mathbb{R}^N (set at N=512N=512 points at 1-hour granularity),
  • A fine-resolution context xfRNx_f \in \mathbb{R}^N (N=512N=512 points at 1-minute granularity).

The ratio K=60K=60 relates the two resolutions; NN coarse points span the same interval as NKN \cdot K fine points. The model function is:

F:RN×RNRLF:\mathbb{R}^N \times \mathbb{R}^N \rightarrow \mathbb{R}^L

where L=128L=128 is the fine-resolution forecast horizon.

Input Preprocessing and Embedding

Both contexts undergo normalization: the initial 32 points of xcx_c and xfx_f yield respective means μc,μf\mu_c,\mu_f and standard deviations σc,σf\sigma_c,\sigma_f. Each context is standardized:

xc=xcμcσc,xf=xfμfσfx'_c = \frac{x_c - \mu_c}{\sigma_c},\quad x'_f = \frac{x_f - \mu_f}{\sigma_f}

Each normalized context is partitioned into M=N/p=16M=N/p=16 patches, producing $32$ patches overall.

Patch embedding is accomplished via:

gin(u)=WoSiLU(Whu)+WruRd,d=1280g_{in}(u) = W_o \, \text{SiLU}(W_h u) + W_r u \in \mathbb{R}^d, \quad d=1280

yielding sequence tokens hih_i.

Special Tokens and Resolution Embeddings

A learned Special Token (ST) delimits the coarse and fine resolutions in the final input sequence:

[h1,,h16,ST,h17,,h32][h_1, \ldots, h_{16}, ST, h_{17}, \ldots, h_{32}]

Two learned resolution embeddings rcoarse,rfineRdr_{coarse}, r_{fine} \in \mathbb{R}^d are added to their respective regions. The 33-token sequence is input to a stack of 50 transformer decoder layers, mirroring the TimesFM processing pipeline.

Output Quantities

The output is a mean prediction y^RL\hat{y} \in \mathbb{R}^L together with quantile forecasts y^(q)\hat{y}^{(q)} for q{0.1,0.2,...,0.9}q \in \{0.1,0.2,...,0.9\}, produced via a final un-embedding residual block.

Autoregressive Multiresolution Decoding

Forecasting proceeds autoregressively: predicted fine-resolution values y^1,,y^L\hat{y}_1,\ldots,\hat{y}_L are appended to xfx_f. Coarse context is updated by aggregating consecutive blocks of KK fine predictions:

new xc[xc[2:],      (1/K)j=1Ky^j,]\text{new } x_c \leftarrow [x_c[2:\dots], \;\;\; (1/K)\sum_{j=1}^K \hat{y}_j, \ldots]

ensuring future decode-steps see paired coarse and fine contexts.

2. Training Regime and Data Pipeline

TSM is trained in the zero-shot forecasting regime. For each context pair (xc,xf)(x_c,x_f) and ground truth horizon yRLy \in \mathbb{R}^L, the model produces a median forecast and quantiles. The loss function is a weighted sum of mean squared error (MSE) and quantile regression losses: LMSE=y^((yμf)/σf)2L_{MSE} = \|\hat{y} - ((y-\mu_f)/\sigma_f)\|^2

Lq=i=1Lmax(q(yiy^i(q)),(q1)(yiy^i(q)))L_{q} = \sum_{i=1}^L \max(q\cdot(y_i-\hat{y}_i^{(q)}), (q-1)\cdot(y_i-\hat{y}_i^{(q)}))

L=LMSE+q{0.1,,0.9}LqL = L_{MSE} + \sum_{q\in\{0.1,\ldots,0.9\}} L_q

Data Sources

Training spans 20 epochs over >300 billion unique data points, sourced as follows:

Data Subset Contribution (%) Description
1-min observability (Splunk) 35 \sim400M series, 13 months
5-min observability 16.5 Observability metrics at coarser granularity
GIFT-Eval (public) 29.5 4.5M series, 230B points
Chronos datasets 4.5 0.9M series, 85B points
Synthetic (KernelSynth) 14.5 Artificially generated series

Windows of length (512, 512) for fine and coarse streams are extracted using a sliding window, filtered for missingness, flat spots, spectral entropy, and abrupt steps. SimHash-based and distance-based statistical deduplication ensure diversity and avoid domination by repetitive series.

3. Multiresolution Design and Long-Context Forecasting

Traditional single-resolution models with context length CC directly observe only CC fine steps. At 1-minute resolution, forecasting over WW hours requires C=60WC=60W tokens. TSM's paired context covers NN coarse points (1 hour each) and N×KN\times K fine points (1 minute), so with only $1025$ tokens ($512 + 1 + 512$), the model observes $512$ hours of coarse history and $512$ minutes of fine history simultaneously.

This structure allows the model to directly attend across all tokens—coarse-to-fine and within each resolution—enabling fusion of long-range low-frequency (trend, seasonality) and short-range high-frequency (intraday, noise) patterns. Empirically, this yields enhanced long-range forecasting performance without sacrificing local accuracy.

4. Empirical Evaluation

TSM was assessed on both observability and general time series benchmarks:

Observability Benchmarks

On out-of-domain, in-the-future splits at 1-minute granularity (context: 512 fine + 512 coarse, horizon: 128), metrics normalized by a last-value baseline (Naive) demonstrate consistent improvement:

Metric Cisco TSM TimesFM-2.5 (512) Chronos-2 (512) Toto-1.0 (512) AutoARIMA (512)
MSE 0.8524 0.8838 0.8816 0.8836 4.0520
MAE 0.4788 0.6265 0.6023 0.6055 0.8545
MASE 0.4569 0.7290 0.7056 0.6834 0.9381
sMAPE 0.7758 0.8297 0.7811 0.7741† 1.3316
MSIS 0.1207 0.1732 0.1773 0.2032 0.2562
CRPS 0.4126 0.5089 0.4878 0.4932 0.7444

(† best among single-resolution baselines.)

When single-resolution models are given 1024 fine-resolution context, TSM leads or matches on these metrics. Similar performance is observed for 5-minute resolution series.

General Forecasting Benchmark: GIFT-Eval

On non-leaking GIFT-Eval, for windows longer than 512 points (normalized by SeasonalNaive), TSM closely matches or slightly lags TimesFM-2.5 on global averages (MAE 0.6980 vs. 0.6635) but achieves higher performance on long-context subsets, indicating that multiresolution adaptation does not impair general-purpose forecasting.

Qualitative Analyses

  • For series exhibiting strong diurnal or weekly seasonality, coarse (1-hour) context captures patterns unreachable by 512-minute windows.
  • In series with noise or regime shifts, extended historical context helps filter transient spikes and identify underlying trends.

5. Limitations and Prospective Extensions

TSM currently fixes two input resolutions and employs a single special token to demarcate the coarse/fine boundary. More flexible formulations—such as variable-length contexts, more than two resolutions, or dynamic token placement—may yield further gains. The architecture is restricted to univariate modeling; multivariate temporal dependencies remain unmodeled. For extremely abrupt or chaotic time series, even long context is insufficient for effective prediction.

Continued pre-training (CPT) of TimesFM on mixed data with multiresolution tokens accelerates convergence and improves observability performance, without adverse impact on general benchmarks. Ablation studies indicate that concatenating contexts absent the resolution embeddings and special token achieves reasonable but slower learning.

6. Applications Within Cisco Observability

Within Cisco's observability stack, TSM is deployed for several forecasting workflows:

  • Real-time inference for infrastructure and application metrics at high (minute) and low (hour) granularities, supporting anomaly detection.
  • Capacity planning tasks, leveraging coarse context to project growth and seasonality over multi-week horizons.
  • As a zero-shot forecasting engine integrated into Splunk’s Observability Cloud, TSM forecasts novel series without per-series retraining.

Summary: The Cisco Time Series Model extends the TimesFM decoder-only backbone through a lightweight multiresolution scheme—comprising a special token, dedicated resolution embeddings, and multiresolution autoregressive updates. Trained on 300 billion points (approx. half from proprietary observability data), TSM provides a scalable, zero-shot forecasting backbone for observability scenarios while maintaining strong results on diverse, publicly available forecasting benchmarks (Gou et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Cisco Time Series Model.