- The paper introduces a novel anomaly-attention mechanism that exploits association discrepancy to differentiate anomalies from regular time points.
- It integrates a learnable Gaussian kernel for prior-association and refines series-association via self-attention weights to improve detection accuracy.
- Empirical results on benchmarks like SMAP and PSM demonstrate superior precision, recall, and F1-scores across diverse anomaly types.
The presented paper introduces the "Anomaly Transformer" methodology for unsupervised time series anomaly detection, leveraging the intrinsic properties of Transformers in temporal association modeling. The cornerstone of the proposed methodology is the identification and utilization of association discrepancy for distinguishing normal points from anomalies in time series data.
Traditionally, unsupervised anomaly detection in time series has been impeded by the challenge of extrapolating informative, distinguishable criteria from complex temporal dynamics. Classical approaches such as local outlier factor (LOF) or one-class SVM do not capitalize on temporal information, limiting their effectiveness in real-world scenarios. Contemporary deep learning models have advanced this field, particularly through representation learning techniques, including reconstruction-based and autoregression-based paradigms. However, these focus mainly on pointwise representation and provide limited contextual understanding, especially in the presence of rare anomalies.
In contrast, the Anomaly Transformer utilizes a novel anomaly-attention mechanism that seeks to expose the inherent discrepancies between associations formed by anomalous and non-anomalous time points. The paper articulates that anomalies, due to their rarity, inherently present challenges in forming robust associations with the broader temporal series, confining their associations to adjacent time points. This observation is conceptualized as "association discrepancy," serving as a novel anomaly detection criterion.
Technically, the Anomaly Transformer architecture integrates this anomaly-attention framework, which includes two interactions: prior-association and series-association. The prior-association is modeled via a learnable Gaussian kernel, uniquely capturing the adjacency-bias of anomalies, while the series-association is refined from the self-attention weights, denoting the broader association profile of each time point.
The model incorporates a minimax strategy to magnify the discriminative power of association discrepancy. By optimizing the divergence between prior- and series-associations (computed via the symmetrized KL divergence), the model accentuates differences between normal and unusual temporal patterns. This dual optimization encompasses minimization to fit normal associations and maximization to pinpoint anomalies.
The empirical validation of this model spans six standard benchmarks, encompassing applications such as server management, space exploration telemetry, and water treatment facilities. The Anomaly Transformer demonstrates superior performance against industry-standard methodologies, evidenced by consistent improvements across precision, recall, and F1-score metrics. For instance, on datasets such as SMAP and PSM, it achieves F1-scores of 96.69% and 97.89%, respectively.
Furthermore, the Anomaly Transformer is robust to various anomaly types, validated by experiments on the NeurIPS-TS benchmark, capturing point-global, pattern-contextual, pattern-shapelet, pattern-seasonal, and pattern-trend anomalies with enhanced accuracy.
In conclusion, the Anomaly Transformer's ability to model and exploit association discrepancy introduces significant advancements in unsupervised time series anomaly detection. The approach sets a precedence for utilizing transformers' capacity for temporal and relational modeling, showcasing potential for further research in extending its capabilities to multivariate time series and augmenting computational efficiency. Future explorations could also detail theoretical analyses in line with classical paradigms like autoregression, enhancing the model's interpretability and deployment across broader domains.