Cisco Time Series Model (TSM) Overview

Updated 9 February 2026

Cisco TSM is a univariate zero-shot forecaster that uses a decoder-only transformer with multiresolution input handling, integrating coarse and fine-grained contexts.
The model leverages a special token and resolution embeddings to efficiently process over 300 billion data points from observability telemetry, ensuring precise forecasts.
TSM is trained with advanced techniques and evaluated on both observability and general-purpose benchmarks, evidencing superior predictive accuracy and robust generalization.

The Cisco Time Series Model (TSM) is a univariate zero-shot forecaster designed for long-context, multiresolution time series forecasting, with a particular emphasis on observability datasets. TSM achieves its capabilities by introducing a general architectural innovation—multiresolution input handling—into a decoder-only transformer backbone (TimesFM), thereby enabling simultaneous ingestion and processing of coarse-grained (“history”) and fine-grained (“detail”) time series contexts. Trained on over 300 billion data points, including extensive high-resolution telemetry, TSM delivers state-of-the-art predictive accuracy in observability domains and maintains competitive performance on standard general-purpose benchmarks (Gou et al., 25 Nov 2025).

1. Architectural Principles and Model Structure

TSM is fundamentally built upon the TimesFM decoder-only transformer, which tokenizes univariate time series into non-overlapping patches of fixed input length $P_{\mathrm{in}}$ . Each patch $u \in \mathbb{R}^{P_\mathrm{in}}$ is embedded using a residual mapping: $g_\mathrm{in}(u) = W_o \, \phi(W_h u) + W_r u$ with learned parameters and nonlinearity $\phi$ , yielding a latent token in $\mathbb{R}^d$ . The stacked sequence is processed by $L$ causal transformer layers; outputs are mapped back to the data domain with $g_\mathrm{out}$ .

The core innovation is the multiresolution extension: TSM concurrently processes a “coarse” view $x_c \in \mathbb{R}^{N_c}$ and a “fine” view $x_f \in \mathbb{R}^{N_f}$ , predicting the next $P_\mathrm{out}$ fine-grained points. The coarse/fine granularity is fixed via the ratio: $K = \frac{\text{coarse-resolution step size}}{\text{fine-resolution step size}}$ (e.g., $K = 60$ , corresponding to 1-h and 1-min levels).

Preprocessing involves zero-mean unit-variance normalization: $\tilde{x}_c = \frac{x_c - \mu_c}{\sigma_c}, \qquad \tilde{x}_f = \frac{x_f - \mu_f}{\sigma_f}$ Tokenization splits each normalized stream into $M = N_c / P_{\mathrm{in}}$ patches, for a joint total of $2M$ tokens.

TSM introduces two architectural components:

Special Token (ST): a learnable vector $\text{ST} \in \mathbb{R}^d$ inserted as a delimiter between the coarse and fine streams, controlling attention flow.
Resolution Embeddings (RE): a learnable lookup $\text{RE} : \{0,1\} \rightarrow \mathbb{R}^d$ added to each token to encode its resolution type, with the sequence order: $[h_1, \dots, h_M, \text{ST}, h_{M+1}, \dots, h_{2M}]$ and

$h_i \leftarrow h_i + \text{RE}(z_i),\quad z_i = \begin{cases} 1 & i \leq M \ 0 & i > M \text{ or } i = \text{ST} \end{cases}$

During multiresolution autoregressive decoding, predicted fine-resolution outputs ( $\{\hat y_i\}$ ) are appended to the context and aggregated to update the coarse context for the next step: $\bar y_k = \frac{1}{K} \sum_{j=1}^K \hat y_{(k-1)K + j},\quad k = 1, \dots, \left\lfloor \frac{L}{K} \right\rfloor$ This allows TSM to incorporate its own predictions into both views for subsequent predictions.

2. Training Data Composition and Preprocessing

TSM was trained on a corpus of approximately 300 billion unique data points, composed as follows:

Data Source	Percentage	Notes
1-min observability series	35%	≈400M, 13 months
5-min observability roll-ups	16.5%
GIFT-Eval public corpus	29.5%	4.5M series, 230B pts
Chronos public corpus	4.5%	0.9M series, 85B pts
Synthetic via KernelSynth	14.5%

Preprocessing includes last-value extrapolation for short gaps and dropping excessively gappy series, differencing cumulative counters, and extracting sliding windows for context/horizon pairs with strict temporal splits to avoid “peeking.” Fine-resolution windows undergo further filtering: flat spots or insufficient unique values are dropped, as are those with high horizon/context deviation or low “learnability” (approximated via spectral entropy). Diversity is promoted through proportional source mixing, history-length balancing, and medoid-based deduplication via SimHash clustering.

Supervision targets both point forecasts (means) and quantiles. For each horizon point $y$ , with predicted mean $\hat y$ and quantiles $\hat y^{(q)}$ for $q \in \{0.1, ..., 0.9\}$ :

Mean squared error loss:

$\mathcal{L}_\mathrm{MSE} = \frac{1}{L} \sum_{i=1}^L (\hat y_i - y_i)^2$

Quantile loss:

$\mathcal{L}_\mathrm{Quantile}^{(q)} = \frac{1}{L} \sum_{i=1}^L \begin{cases} q(y_i - \hat y_i^{(q)}) &\text{if } y_i \geq \hat y_i^{(q)} \ (1 - q)(\hat y_i^{(q)} - y_i) &\text{otherwise} \end{cases}$

Total training loss:

$\mathcal{L} = \mathcal{L}_\mathrm{MSE} + \sum_{q \in \{0.1, ..., 0.9\}} \mathcal{L}_\mathrm{Quantile}^{(q)}$

Training utilized 500 million parameters; parameter-specific optimizers (AdamW and Muon), cosine learning rate annealing, gradient clipping, and large-batch distributed training (batch size 65,536 on 64×H200 GPUs) were employed. Early stopping occurred at optimal validation loss, typically around epoch 5–10 of 20.

3. Zero-Shot Forecasting Mechanism

TSM is a zero-shot forecaster: for any new univariate series, the model can be applied without further fine-tuning. The process involves extracting the last context window at both coarse and fine resolutions, normalizing by the corresponding means and standard deviations: $\hat{y}_{1:L} = \mu_f + \sigma_f F\left( \frac{x_c - \mu_c}{\sigma_c}, \frac{x_f - \mu_f}{\sigma_f} \right)$ where $F$ is the trained TSM mapping. Multistep autoregression proceeds as described above; no recalibration or adaption to the new series is performed at inference. A plausible implication is that TSM’s generalization largely derives from the breadth and balancing of its training distribution.

4. Quantitative and Qualitative Evaluation

4.1 Observability Benchmarks

TSM was evaluated on 1-min and 5-min observability series, benchmarked against a Naive baseline and TimesFM-2.5. For context length 512 (horizon not specified):

Metric	TSM	TimesFM-2.5
MSE	0.8524	0.8838
MAE	0.4788	0.6265
MASE	0.4569	0.7290
sMAPE	0.7758	0.8297
MSIS	0.1207	0.1732
CRPS	0.4126	0.5089

These results confirm that TSM delivers superior performance across all tabulated metrics. Gains persist in 5-min data and at longer context lengths (e.g., 1024), where TSM continues to outperform or match baselines.

4.2 General-Purpose (GIFT-Eval) Benchmarks

On GIFT-Eval, with public “leaked” datasets removed, TSM (512+512 context) achieves:

MSE: 0.5423 (TSM) vs. 0.5111 (TimesFM-2.5, 1024 context)
MAE: 0.6980 vs. 0.6635
MASE: 0.7365 vs. 0.6828
sMAPE: 1.1053 vs. 0.9416
MSIS: 0.5649 vs. 0.5230
CRPS: 0.5508 vs. 0.5247

TSM is highly competitive, with only slight degradation relative to specialist general-purpose models except for the observability domain, where it excels.

4.3 Ablation Studies

Ablation experiments assessed architectural components at two scales (10B, 35B samples). Four variants—concatenation (CONCAT), resolution embeddings only (RE), ST only, RE+ST—were compared. On 10B, normalized MAE (observability, 1-min) results:

CONCAT: 0.5144
RE+ST: 0.5246
ST only: 1.3361
RE only: 0.5227

At larger scale (35B), architectures with both RE and ST converge faster and match or slightly exceed the alternatives. ST-only and RE-only formulations showed instability at scale.

5. Qualitative Analysis and Representative Scenarios

TSM’s multiresolution approach reveals enhanced extrapolation ability on long-context scenarios:

Long-coarse context with low error (“long-low” quadrant): Access to extended coarse histories allows TSM to resolve seasonal or sawtooth temporal phenomena that are otherwise ambiguous from fine-grained context alone.
Periodic series and multi-regime behavior: Longer history yields phase-locked forecasts and correct regime selection in cases where short-term and long-term trends conflict.
Sawtooth and complex dynamics: Hour-scale structure, visible only in coarse context, enables correct frequency modeling.
Limitations remain for cases with padding-heavy coarse context and abrupt transitions, as demonstrated by residual high error in corresponding case studies.

6. Limitations and Prospects for Extension

Several architectural and procedural limitations are identified:

The fixed insertion point of the special token (ST) restricts positional flexibility; extensions to variable-length or hierarchical block insertions could be explored.
TSM currently supports only two input resolutions; generalizing to multiple hierarchical resolutions would require multiple STs and REs, forming an area for future investigation.
The autoregressive multistep decoding mechanism is susceptible to bias accumulation across long horizons; non-autoregressive or partially parallel decoding may mitigate drift.
The quantile-regression output head provides coarse uncertainty estimation; adoption of more expressive probabilistic heads (e.g., mixture models) could improve uncertainty quantification.

In aggregate, TSM demonstrates that decoder-only multiresolution transformers, when trained on a carefully curated, large-scale, diverse corpus, can deliver state-of-the-art, zero-shot forecasts in observability and competitive accuracy in broad forecasting tasks. The architecture, preprocessing strategies, and ablation evidence provide concrete guidelines for adapting transformer-based time series foundation models (TSFMs) to settings with long-range dependencies and multiscale structure (Gou et al., 25 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Cisco Time Series Model Technical Report (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cisco Time Series Model (TSM).

Cisco Time Series Model (TSM) Overview

1. Architectural Principles and Model Structure

2. Training Data Composition and Preprocessing

3. Zero-Shot Forecasting Mechanism

4. Quantitative and Qualitative Evaluation

4.1 Observability Benchmarks

4.2 General-Purpose (GIFT-Eval) Benchmarks

4.3 Ablation Studies

5. Qualitative Analysis and Representative Scenarios

6. Limitations and Prospects for Extension

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Cisco Time Series Model (TSM) Overview

1. Architectural Principles and Model Structure

2. Training Data Composition and Preprocessing

3. Zero-Shot Forecasting Mechanism

4. Quantitative and Qualitative Evaluation

4.1 Observability Benchmarks

4.2 General-Purpose (GIFT-Eval) Benchmarks

4.3 Ablation Studies

5. Qualitative Analysis and Representative Scenarios

6. Limitations and Prospects for Extension

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research