Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Anomaly Detection using Autoencoders in High Performance Computing Systems (1811.05269v1)

Published 13 Nov 2018 in cs.LG and cs.AI

Abstract: Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems and the high number of components. The current state of the art for automated anomaly detection employs Machine Learning methods or statistical regression models in a supervised fashion, meaning that the detection tool is trained to distinguish among a fixed set of behaviour classes (healthy and unhealthy states). We propose a novel approach for anomaly detection in High Performance Computing systems based on a Machine (Deep) Learning technique, namely a type of neural network called autoencoder. The key idea is to train a set of autoencoders to learn the normal (healthy) behaviour of the supercomputer nodes and, after training, use them to identify abnormal conditions. This is different from previous approaches which where based on learning the abnormal condition, for which there are much smaller datasets (since it is very hard to identify them to begin with). We test our approach on a real supercomputer equipped with a fine-grained, scalable monitoring infrastructure that can provide large amount of data to characterize the system behaviour. The results are extremely promising: after the training phase to learn the normal system behaviour, our method is capable of detecting anomalies that have never been seen before with a very good accuracy (values ranging between 88% and 96%).

Citations (175)

Summary

  • The paper proposes a semi-supervised anomaly detection method for High Performance Computing systems using autoencoders to overcome the limitations of traditional supervised methods that require extensive labeled anomaly data.
  • The method was validated on the D.A.V.I.D.E. supercomputer, achieving high detection precision between 88% and 96% and demonstrating effectiveness against novel anomalies.
  • This approach provides a scalable and adaptable way for system administrators to improve system resilience, reduce downtime, and better maintain operational fidelity in complex HPC environments.

Anomaly Detection Using Autoencoders in High Performance Computing Systems

This paper, authored by Andrea Borghesi et al., explores the challenge of anomaly detection in High Performance Computing (HPC) systems, an issue of increased relevance given the growing complexity and scale of such systems. The authors propose an anomaly detection approach leveraging autoencoders, a type of neural network known for its utility in unsupervised learning tasks, to address the limitations of existing supervised machine learning methods.

Core Concepts and Methodology

Traditional anomaly detection in HPC relies heavily on supervised models, requiring labeled data to differentiate between normal and anomalous states. This process is constrained by the scarcity of labeled data for abnormalities, as healthy system behavior dominates the operational time of supercomputers. Borghesi and colleagues pivot to a semi-supervised strategy, utilizing autoencoders to model the normal behavior of computing nodes, identifying anomalies through deviations in reconstruction errors.

The methodology involves training autoencoders on normal system behavior, enabling the model to learn the typical distributions and correlations within the input features. Post-training, the model's inability to reconstruct anomalous data with the same precision as normal inputs is leveraged to flag anomalies. This approach does not require the generation of labeled datasets for anomalies, thereby addressing a significant challenge of supervised approaches.

Experimental Validation

The proposed method is empirically validated using data from the D.A.V.I.D.E. supercomputer, utilizing a custom monitoring infrastructure known as Examon. The experiments focused on anomalies arising from CPU frequency governor misconfigurations, including both power-saving and performance-centric anomalies, as these configurations significantly alter typical node performance metrics.

The results indicate high detection precision, with accuracy rates ranging from 88% to 96% across different experimental setups. Notably, the approach demonstrated efficacy in recognizing novel anomalies not previously encountered during the training phase, underpinning the robustness of the semi-supervised methodology.

Implications and Future Directions

This research presents substantive implications for the management of supercomputers and data centers. By utilizing autoencoders for anomaly detection, system administrators can better maintain operational fidelity, enhance system resilience, and potentially alleviate downtime-related financial impacts. Furthermore, the reduced reliance on labeled datasets makes this approach more scalable and adaptable to ongoing advancements in HPC systems.

The paper also hints at the broader potential applications of this methodology beyond the tested anomaly types, suggesting its applicability to a wide array of unexpected system states. Future work is directed towards extending the approach to classify the nature of detected anomalies, potentially incorporating more complex network architectures or hybrid models to fine-tune classification accuracy.

Conclusion

The proposition of using autoencoders for anomaly detection within HPC systems marks a meaningful step away from the constraints of supervised learning. Borghesi et al.'s work illustrates the practicality and effectiveness of semi-supervised learning in high-stakes environments, laying the groundwork for future advancements in real-time anomaly detection and classification in the field of supercomputing. As AI technology continues to evolve, methodologies such as this will likely become integral to the maintenance and optimization of next-generation computing infrastructures.