Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cisco Time Series Model (TSM) Overview

Updated 9 February 2026
  • Cisco TSM is a univariate zero-shot forecaster that uses a decoder-only transformer with multiresolution input handling, integrating coarse and fine-grained contexts.
  • The model leverages a special token and resolution embeddings to efficiently process over 300 billion data points from observability telemetry, ensuring precise forecasts.
  • TSM is trained with advanced techniques and evaluated on both observability and general-purpose benchmarks, evidencing superior predictive accuracy and robust generalization.

The Cisco Time Series Model (TSM) is a univariate zero-shot forecaster designed for long-context, multiresolution time series forecasting, with a particular emphasis on observability datasets. TSM achieves its capabilities by introducing a general architectural innovation—multiresolution input handling—into a decoder-only transformer backbone (TimesFM), thereby enabling simultaneous ingestion and processing of coarse-grained (“history”) and fine-grained (“detail”) time series contexts. Trained on over 300 billion data points, including extensive high-resolution telemetry, TSM delivers state-of-the-art predictive accuracy in observability domains and maintains competitive performance on standard general-purpose benchmarks (Gou et al., 25 Nov 2025).

1. Architectural Principles and Model Structure

TSM is fundamentally built upon the TimesFM decoder-only transformer, which tokenizes univariate time series into non-overlapping patches of fixed input length PinP_{\mathrm{in}}. Each patch uRPinu \in \mathbb{R}^{P_\mathrm{in}} is embedded using a residual mapping: gin(u)=Woϕ(Whu)+Wrug_\mathrm{in}(u) = W_o \, \phi(W_h u) + W_r u with learned parameters and nonlinearity ϕ\phi, yielding a latent token in Rd\mathbb{R}^d. The stacked sequence is processed by LL causal transformer layers; outputs are mapped back to the data domain with goutg_\mathrm{out}.

The core innovation is the multiresolution extension: TSM concurrently processes a “coarse” view xcRNcx_c \in \mathbb{R}^{N_c} and a “fine” view xfRNfx_f \in \mathbb{R}^{N_f}, predicting the next PoutP_\mathrm{out} fine-grained points. The coarse/fine granularity is fixed via the ratio: K=coarse-resolution step sizefine-resolution step sizeK = \frac{\text{coarse-resolution step size}}{\text{fine-resolution step size}} (e.g., K=60K = 60, corresponding to 1-h and 1-min levels).

Preprocessing involves zero-mean unit-variance normalization: x~c=xcμcσc,x~f=xfμfσf\tilde{x}_c = \frac{x_c - \mu_c}{\sigma_c}, \qquad \tilde{x}_f = \frac{x_f - \mu_f}{\sigma_f} Tokenization splits each normalized stream into M=Nc/PinM = N_c / P_{\mathrm{in}} patches, for a joint total of $2M$ tokens.

TSM introduces two architectural components:

  • Special Token (ST): a learnable vector STRd\text{ST} \in \mathbb{R}^d inserted as a delimiter between the coarse and fine streams, controlling attention flow.
  • Resolution Embeddings (RE): a learnable lookup RE:{0,1}Rd\text{RE} : \{0,1\} \rightarrow \mathbb{R}^d added to each token to encode its resolution type, with the sequence order: [h1,,hM,ST,hM+1,,h2M][h_1, \dots, h_M, \text{ST}, h_{M+1}, \dots, h_{2M}] and

hihi+RE(zi),zi={1iM 0i>M or i=STh_i \leftarrow h_i + \text{RE}(z_i),\quad z_i = \begin{cases} 1 & i \leq M \ 0 & i > M \text{ or } i = \text{ST} \end{cases}

During multiresolution autoregressive decoding, predicted fine-resolution outputs ({y^i}\{\hat y_i\}) are appended to the context and aggregated to update the coarse context for the next step: yˉk=1Kj=1Ky^(k1)K+j,k=1,,LK\bar y_k = \frac{1}{K} \sum_{j=1}^K \hat y_{(k-1)K + j},\quad k = 1, \dots, \left\lfloor \frac{L}{K} \right\rfloor This allows TSM to incorporate its own predictions into both views for subsequent predictions.

2. Training Data Composition and Preprocessing

TSM was trained on a corpus of approximately 300 billion unique data points, composed as follows:

Data Source Percentage Notes
1-min observability series 35% ≈400M, 13 months
5-min observability roll-ups 16.5%
GIFT-Eval public corpus 29.5% 4.5M series, 230B pts
Chronos public corpus 4.5% 0.9M series, 85B pts
Synthetic via KernelSynth 14.5%

Preprocessing includes last-value extrapolation for short gaps and dropping excessively gappy series, differencing cumulative counters, and extracting sliding windows for context/horizon pairs with strict temporal splits to avoid “peeking.” Fine-resolution windows undergo further filtering: flat spots or insufficient unique values are dropped, as are those with high horizon/context deviation or low “learnability” (approximated via spectral entropy). Diversity is promoted through proportional source mixing, history-length balancing, and medoid-based deduplication via SimHash clustering.

Supervision targets both point forecasts (means) and quantiles. For each horizon point yy, with predicted mean y^\hat y and quantiles y^(q)\hat y^{(q)} for q{0.1,...,0.9}q \in \{0.1, ..., 0.9\}:

  • Mean squared error loss:

LMSE=1Li=1L(y^iyi)2\mathcal{L}_\mathrm{MSE} = \frac{1}{L} \sum_{i=1}^L (\hat y_i - y_i)^2

  • Quantile loss:

LQuantile(q)=1Li=1L{q(yiy^i(q))if yiy^i(q) (1q)(y^i(q)yi)otherwise\mathcal{L}_\mathrm{Quantile}^{(q)} = \frac{1}{L} \sum_{i=1}^L \begin{cases} q(y_i - \hat y_i^{(q)}) &\text{if } y_i \geq \hat y_i^{(q)} \ (1 - q)(\hat y_i^{(q)} - y_i) &\text{otherwise} \end{cases}

  • Total training loss:

L=LMSE+q{0.1,...,0.9}LQuantile(q)\mathcal{L} = \mathcal{L}_\mathrm{MSE} + \sum_{q \in \{0.1, ..., 0.9\}} \mathcal{L}_\mathrm{Quantile}^{(q)}

Training utilized 500 million parameters; parameter-specific optimizers (AdamW and Muon), cosine learning rate annealing, gradient clipping, and large-batch distributed training (batch size 65,536 on 64×H200 GPUs) were employed. Early stopping occurred at optimal validation loss, typically around epoch 5–10 of 20.

3. Zero-Shot Forecasting Mechanism

TSM is a zero-shot forecaster: for any new univariate series, the model can be applied without further fine-tuning. The process involves extracting the last context window at both coarse and fine resolutions, normalizing by the corresponding means and standard deviations: y^1:L=μf+σfF(xcμcσc,xfμfσf)\hat{y}_{1:L} = \mu_f + \sigma_f F\left( \frac{x_c - \mu_c}{\sigma_c}, \frac{x_f - \mu_f}{\sigma_f} \right) where FF is the trained TSM mapping. Multistep autoregression proceeds as described above; no recalibration or adaption to the new series is performed at inference. A plausible implication is that TSM’s generalization largely derives from the breadth and balancing of its training distribution.

4. Quantitative and Qualitative Evaluation

4.1 Observability Benchmarks

TSM was evaluated on 1-min and 5-min observability series, benchmarked against a Naive baseline and TimesFM-2.5. For context length 512 (horizon not specified):

Metric TSM TimesFM-2.5
MSE 0.8524 0.8838
MAE 0.4788 0.6265
MASE 0.4569 0.7290
sMAPE 0.7758 0.8297
MSIS 0.1207 0.1732
CRPS 0.4126 0.5089

These results confirm that TSM delivers superior performance across all tabulated metrics. Gains persist in 5-min data and at longer context lengths (e.g., 1024), where TSM continues to outperform or match baselines.

4.2 General-Purpose (GIFT-Eval) Benchmarks

On GIFT-Eval, with public “leaked” datasets removed, TSM (512+512 context) achieves:

  • MSE: 0.5423 (TSM) vs. 0.5111 (TimesFM-2.5, 1024 context)
  • MAE: 0.6980 vs. 0.6635
  • MASE: 0.7365 vs. 0.6828
  • sMAPE: 1.1053 vs. 0.9416
  • MSIS: 0.5649 vs. 0.5230
  • CRPS: 0.5508 vs. 0.5247

TSM is highly competitive, with only slight degradation relative to specialist general-purpose models except for the observability domain, where it excels.

4.3 Ablation Studies

Ablation experiments assessed architectural components at two scales (10B, 35B samples). Four variants—concatenation (CONCAT), resolution embeddings only (RE), ST only, RE+ST—were compared. On 10B, normalized MAE (observability, 1-min) results:

  • CONCAT: 0.5144
  • RE+ST: 0.5246
  • ST only: 1.3361
  • RE only: 0.5227

At larger scale (35B), architectures with both RE and ST converge faster and match or slightly exceed the alternatives. ST-only and RE-only formulations showed instability at scale.

5. Qualitative Analysis and Representative Scenarios

TSM’s multiresolution approach reveals enhanced extrapolation ability on long-context scenarios:

  • Long-coarse context with low error (“long-low” quadrant): Access to extended coarse histories allows TSM to resolve seasonal or sawtooth temporal phenomena that are otherwise ambiguous from fine-grained context alone.
  • Periodic series and multi-regime behavior: Longer history yields phase-locked forecasts and correct regime selection in cases where short-term and long-term trends conflict.
  • Sawtooth and complex dynamics: Hour-scale structure, visible only in coarse context, enables correct frequency modeling.
  • Limitations remain for cases with padding-heavy coarse context and abrupt transitions, as demonstrated by residual high error in corresponding case studies.

6. Limitations and Prospects for Extension

Several architectural and procedural limitations are identified:

  • The fixed insertion point of the special token (ST) restricts positional flexibility; extensions to variable-length or hierarchical block insertions could be explored.
  • TSM currently supports only two input resolutions; generalizing to multiple hierarchical resolutions would require multiple STs and REs, forming an area for future investigation.
  • The autoregressive multistep decoding mechanism is susceptible to bias accumulation across long horizons; non-autoregressive or partially parallel decoding may mitigate drift.
  • The quantile-regression output head provides coarse uncertainty estimation; adoption of more expressive probabilistic heads (e.g., mixture models) could improve uncertainty quantification.

In aggregate, TSM demonstrates that decoder-only multiresolution transformers, when trained on a carefully curated, large-scale, diverse corpus, can deliver state-of-the-art, zero-shot forecasts in observability and competitive accuracy in broad forecasting tasks. The architecture, preprocessing strategies, and ablation evidence provide concrete guidelines for adapting transformer-based time series foundation models (TSFMs) to settings with long-range dependencies and multiscale structure (Gou et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cisco Time Series Model (TSM).