Cisco Time Series Model (TSM) Overview
- Cisco TSM is a univariate zero-shot forecaster that uses a decoder-only transformer with multiresolution input handling, integrating coarse and fine-grained contexts.
- The model leverages a special token and resolution embeddings to efficiently process over 300 billion data points from observability telemetry, ensuring precise forecasts.
- TSM is trained with advanced techniques and evaluated on both observability and general-purpose benchmarks, evidencing superior predictive accuracy and robust generalization.
The Cisco Time Series Model (TSM) is a univariate zero-shot forecaster designed for long-context, multiresolution time series forecasting, with a particular emphasis on observability datasets. TSM achieves its capabilities by introducing a general architectural innovation—multiresolution input handling—into a decoder-only transformer backbone (TimesFM), thereby enabling simultaneous ingestion and processing of coarse-grained (“history”) and fine-grained (“detail”) time series contexts. Trained on over 300 billion data points, including extensive high-resolution telemetry, TSM delivers state-of-the-art predictive accuracy in observability domains and maintains competitive performance on standard general-purpose benchmarks (Gou et al., 25 Nov 2025).
1. Architectural Principles and Model Structure
TSM is fundamentally built upon the TimesFM decoder-only transformer, which tokenizes univariate time series into non-overlapping patches of fixed input length . Each patch is embedded using a residual mapping: with learned parameters and nonlinearity , yielding a latent token in . The stacked sequence is processed by causal transformer layers; outputs are mapped back to the data domain with .
The core innovation is the multiresolution extension: TSM concurrently processes a “coarse” view and a “fine” view , predicting the next fine-grained points. The coarse/fine granularity is fixed via the ratio: (e.g., , corresponding to 1-h and 1-min levels).
Preprocessing involves zero-mean unit-variance normalization: Tokenization splits each normalized stream into patches, for a joint total of $2M$ tokens.
TSM introduces two architectural components:
- Special Token (ST): a learnable vector inserted as a delimiter between the coarse and fine streams, controlling attention flow.
- Resolution Embeddings (RE): a learnable lookup added to each token to encode its resolution type, with the sequence order: and
During multiresolution autoregressive decoding, predicted fine-resolution outputs () are appended to the context and aggregated to update the coarse context for the next step: This allows TSM to incorporate its own predictions into both views for subsequent predictions.
2. Training Data Composition and Preprocessing
TSM was trained on a corpus of approximately 300 billion unique data points, composed as follows:
| Data Source | Percentage | Notes |
|---|---|---|
| 1-min observability series | 35% | ≈400M, 13 months |
| 5-min observability roll-ups | 16.5% | |
| GIFT-Eval public corpus | 29.5% | 4.5M series, 230B pts |
| Chronos public corpus | 4.5% | 0.9M series, 85B pts |
| Synthetic via KernelSynth | 14.5% |
Preprocessing includes last-value extrapolation for short gaps and dropping excessively gappy series, differencing cumulative counters, and extracting sliding windows for context/horizon pairs with strict temporal splits to avoid “peeking.” Fine-resolution windows undergo further filtering: flat spots or insufficient unique values are dropped, as are those with high horizon/context deviation or low “learnability” (approximated via spectral entropy). Diversity is promoted through proportional source mixing, history-length balancing, and medoid-based deduplication via SimHash clustering.
Supervision targets both point forecasts (means) and quantiles. For each horizon point , with predicted mean and quantiles for :
- Mean squared error loss:
- Quantile loss:
- Total training loss:
Training utilized 500 million parameters; parameter-specific optimizers (AdamW and Muon), cosine learning rate annealing, gradient clipping, and large-batch distributed training (batch size 65,536 on 64×H200 GPUs) were employed. Early stopping occurred at optimal validation loss, typically around epoch 5–10 of 20.
3. Zero-Shot Forecasting Mechanism
TSM is a zero-shot forecaster: for any new univariate series, the model can be applied without further fine-tuning. The process involves extracting the last context window at both coarse and fine resolutions, normalizing by the corresponding means and standard deviations: where is the trained TSM mapping. Multistep autoregression proceeds as described above; no recalibration or adaption to the new series is performed at inference. A plausible implication is that TSM’s generalization largely derives from the breadth and balancing of its training distribution.
4. Quantitative and Qualitative Evaluation
4.1 Observability Benchmarks
TSM was evaluated on 1-min and 5-min observability series, benchmarked against a Naive baseline and TimesFM-2.5. For context length 512 (horizon not specified):
| Metric | TSM | TimesFM-2.5 |
|---|---|---|
| MSE | 0.8524 | 0.8838 |
| MAE | 0.4788 | 0.6265 |
| MASE | 0.4569 | 0.7290 |
| sMAPE | 0.7758 | 0.8297 |
| MSIS | 0.1207 | 0.1732 |
| CRPS | 0.4126 | 0.5089 |
These results confirm that TSM delivers superior performance across all tabulated metrics. Gains persist in 5-min data and at longer context lengths (e.g., 1024), where TSM continues to outperform or match baselines.
4.2 General-Purpose (GIFT-Eval) Benchmarks
On GIFT-Eval, with public “leaked” datasets removed, TSM (512+512 context) achieves:
- MSE: 0.5423 (TSM) vs. 0.5111 (TimesFM-2.5, 1024 context)
- MAE: 0.6980 vs. 0.6635
- MASE: 0.7365 vs. 0.6828
- sMAPE: 1.1053 vs. 0.9416
- MSIS: 0.5649 vs. 0.5230
- CRPS: 0.5508 vs. 0.5247
TSM is highly competitive, with only slight degradation relative to specialist general-purpose models except for the observability domain, where it excels.
4.3 Ablation Studies
Ablation experiments assessed architectural components at two scales (10B, 35B samples). Four variants—concatenation (CONCAT), resolution embeddings only (RE), ST only, RE+ST—were compared. On 10B, normalized MAE (observability, 1-min) results:
- CONCAT: 0.5144
- RE+ST: 0.5246
- ST only: 1.3361
- RE only: 0.5227
At larger scale (35B), architectures with both RE and ST converge faster and match or slightly exceed the alternatives. ST-only and RE-only formulations showed instability at scale.
5. Qualitative Analysis and Representative Scenarios
TSM’s multiresolution approach reveals enhanced extrapolation ability on long-context scenarios:
- Long-coarse context with low error (“long-low” quadrant): Access to extended coarse histories allows TSM to resolve seasonal or sawtooth temporal phenomena that are otherwise ambiguous from fine-grained context alone.
- Periodic series and multi-regime behavior: Longer history yields phase-locked forecasts and correct regime selection in cases where short-term and long-term trends conflict.
- Sawtooth and complex dynamics: Hour-scale structure, visible only in coarse context, enables correct frequency modeling.
- Limitations remain for cases with padding-heavy coarse context and abrupt transitions, as demonstrated by residual high error in corresponding case studies.
6. Limitations and Prospects for Extension
Several architectural and procedural limitations are identified:
- The fixed insertion point of the special token (ST) restricts positional flexibility; extensions to variable-length or hierarchical block insertions could be explored.
- TSM currently supports only two input resolutions; generalizing to multiple hierarchical resolutions would require multiple STs and REs, forming an area for future investigation.
- The autoregressive multistep decoding mechanism is susceptible to bias accumulation across long horizons; non-autoregressive or partially parallel decoding may mitigate drift.
- The quantile-regression output head provides coarse uncertainty estimation; adoption of more expressive probabilistic heads (e.g., mixture models) could improve uncertainty quantification.
In aggregate, TSM demonstrates that decoder-only multiresolution transformers, when trained on a carefully curated, large-scale, diverse corpus, can deliver state-of-the-art, zero-shot forecasts in observability and competitive accuracy in broad forecasting tasks. The architecture, preprocessing strategies, and ablation evidence provide concrete guidelines for adapting transformer-based time series foundation models (TSFMs) to settings with long-range dependencies and multiscale structure (Gou et al., 25 Nov 2025).