Toto: Time Series Optimized Transformer for Observability (2407.07874v2)

Published 10 Jul 2024 in cs.LG and cs.AI

Abstract: This technical report describes the Time Series Optimized Transformer for Observability (Toto), a new state of the art foundation model for time series forecasting developed by Datadog. In addition to advancing the state of the art on generalized time series benchmarks in domains such as electricity and weather, this model is the first general-purpose time series forecasting foundation model to be specifically tuned for observability metrics. Toto was trained on a dataset of one trillion time series data points, the largest among all currently published time series foundation models. Alongside publicly available time series datasets, 75% of the data used to train Toto consists of fully anonymous numerical metric data points from the Datadog platform. In our experiments, Toto outperforms existing time series foundation models on observability data. It does this while also excelling at general-purpose forecasting tasks, achieving state-of-the-art zero-shot performance on multiple open benchmark datasets.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces Toto’s novel proportional factorized space-time attention mechanism that efficiently captures both spatial correlations and temporal dependencies.
It employs a Student-T mixture model head to accurately model heavy-tailed distributions and outliers in complex observability metrics.
Empirical results on multiple benchmarks demonstrate that Toto significantly reduces forecast errors compared to state-of-the-art models.

Insights on Toto: Time Series Optimized Transformer for Observability

The paper "Toto: Time Series Optimized Transformer for Observability" introduces the Toto model, a cutting-edge transformer-based foundation model developed by Datadog specifically for time series forecasting. Toto distinguishes itself through its strong focus on observability data, an area with unique challenges including high time resolution, sparsity, extreme skewness, and nonstationarity. This essay will discuss essential aspects of Toto's design, its performance on various benchmarks, and potential implications for future AI research.

Contributions and Novel Mechanisms

The paper highlights several key contributions of the Toto model:

Proportional Factorized Space-Time Attention: This advanced attention mechanism efficiently manages multivariate time series features. The model alternates between space-wise and time-wise attention blocks, allowing for configurability in the ratio of these blocks. This design enhances Toto's capacity to capture both spatial interactions between different metrics and temporal dependencies within each metric.
Student-T Mixture Model Head: Toto utilizes a Student-T mixture model head, a probabilistic mechanism that generalizes beyond traditional Gaussian mixture models. This enables Toto to robustly capture complex time series dynamics with heavy tails and outliers, providing more accurate probabilistic forecasts.
Domain-Specific Training Data: A significant portion of the training data comprises fully anonymized observability metrics from the Datadog platform. This targeted training endows Toto with enhanced capabilities for forecasting observability metrics, which are often characterized by high cardinality, nonstationarity, and diverse distributions.

Model Architecture and Mechanisms

Toto is a decoder-only transformer model that integrates modern design techniques from LLMs:

Pre-normalization and RMSNorm are applied for stabilizing training.
SwiGLU feed-forward layers are implemented to improve efficiency.
Patch embedding is used to handle multivariate time series data, a method borrowed from Vision Transformers and optimized for time series applications.

Experimental Results

Performance on LSF Benchmarks

Toto demonstrates outstanding performance across the Long Sequence Forecasting (LSF) benchmark datasets (ETTh1, ETTh2, ETTm1, ETTm2, Electricity, and Weather). Specifically:

ETTh1: Toto achieves an MAE of 0.389 and an MSE of 0.363, outperforming other zero-shot models, including the Moirai series and TimesFM.
Electricity: Toto excels with an MAE of 0.246 and an MSE of 0.157, leading among the competitive landscape, including models like iTransformer and TimesNet.

On average, Toto achieves superior results across the six LSF datasets, with consistently lower MAE and MSE values compared to other state-of-the-art models in both zero-shot and full-shot scenarios.

Performance on Datadog Benchmark

In the Datadog-specific benchmark:

Toto achieves a symmetric mean absolute percentage error (sMAPE) of 0.672 and symmetric median absolute percentage error (sMdAPE) of 0.318, outperforming models like Chronos-T5 and Moirai.

The diverse and complex nature of observability metrics in the Datadog dataset underscores the robustness and adaptability of Toto. The results suggest that training on relevant, domain-specific datasets significantly enhances the model's performance.

Implications and Future Directions

The success of Toto in handling the intricate characteristics of observability metrics marks a significant advancement in the field. Its design innovations, such as the proportional factorized space-time attention mechanism and the Student-T mixture model head, set a new standard for time series forecasting.

Future research could explore:

Multi-modal inputs: Integrating auxiliary data sources like query metadata and captions can further refine forecast accuracy.
Autonomous troubleshooting agents: Enhancing Datadog's AI agents with time series-specific knowledge for better troubleshooting and incident response capabilities.
Conversational interfaces: Aligning time series models with LLMs to develop conversational agents for interpreting time series data.
Scaling and optimizations: Introducing new types of embeddings, refining attention mechanisms, and optimizing variate groupings to capture richer time series interactions.

Conclusion

Toto exemplifies state-of-the-art advancements in time series forecasting, particularly for observability metrics. The combination of innovative architectural choices and targeted training data enables Toto to deliver unparalleled performance. As the field progresses, exploring the aforementioned future directions will likely yield even more powerful and efficient models for time series analysis in various domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/KyeGomezB/status/1812908207590850951

https://twitter.com/susumuota/status/1816263261618123239

YouTube

Show All Videos

HackerNews

Training a time series model using transformers at Datadog (26 points, 0 comments)