Argos: Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models (2501.14170v1)

Published 24 Jan 2025 in cs.LG, cs.DC, and cs.MA

Abstract: Observability in cloud infrastructure is critical for service providers, driving the widespread adoption of anomaly detection systems for monitoring metrics. However, existing systems often struggle to simultaneously achieve explainability, reproducibility, and autonomy, which are three indispensable properties for production use. We introduce Argos, an agentic system for detecting time-series anomalies in cloud infrastructure by leveraging LLMs. Argos proposes to use explainable and reproducible anomaly rules as intermediate representation and employs LLMs to autonomously generate such rules. The system will efficiently train error-free and accuracy-guaranteed anomaly rules through multiple collaborative agents and deploy the trained rules for low-cost online anomaly detection. Through evaluation results, we demonstrate that Argos outperforms state-of-the-art methods, increasing $F_1$ scores by up to $9.5\%$ and $28.3\%$ on public anomaly detection datasets and an internal dataset collected from Microsoft, respectively.

Summary

The paper introduces a novel agent-based anomaly detection system that employs LLMs for autonomous rule generation on time-series data.
It leverages a multi-stage pipeline with Detection, Repair, and Review Agents to ensure explainable, reproducible, and accurate rule generation.
Evaluation on public and internal datasets shows significant F1 score improvements, demonstrating enhanced performance over traditional methods.

Argos: Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via LLMs

Introduction

The paper introduces Argos, an innovative agentic system for anomaly detection in time-series data within cloud infrastructures, employing LLMs for autonomous rule generation. Argos is designed to enhance anomaly detection systems by ensuring explainability, reproducibility, and autonomy, which are often not simultaneously achieved by existing approaches.

System Design and Architecture

Argos leverages a multi-stage design, comprising data preprocessing, rule training, and deployment phases. The key components of Argos are:

Data Preprocessor: Scales, index, and tokenizes input data for efficient processing within the context of time-series anomaly detection.
Training Engine: Implements an agent-based pipeline with Detection, Repair, and Review Agents, ensuring the generation of syntactically correct and accurate anomaly detection rules.

- Detection Agent: Proposes rules in Python based on input data. - Repair Agent: Corrects syntax errors in proposed rules. - Review Agent: Evaluates and iterates rules to improve accuracy.

Deployment Components: Include an Anomaly Detector and Aggregator, combining outputs from both base detectors and LLM-generated rules to ensure accuracy and resource efficiency.
Figure 1: The overall design of Argos.

Autonomous Rule Generation

Argos distinguishes itself through autonomous rule generation via LLMs. The Detection Agent generates executable Python code for anomaly detection rules, bridging the gap between domain-specific expertise and machine-generated logic. Existing LLM techniques are integrated to ensure rules that are both explainable and reproducible, while maintaining the adaptability of the system to varying anomaly patterns.

Correctness and Accuracy

Argos employs iterative feedback loops between the Repair and Review Agents to improve rule accuracy and correctness. This approach is inspired by backpropagation, ensuring the continuous improvement of anomaly detection rules through systematic error correction and performance evaluation.

Model Fusion for Accuracy Guarantee

The model fusion strategy in Argos combines the strengths of LLM-generated rules and existing well-tuned anomaly detectors to guarantee accuracy improvements. This ensures that new, autonomously generated rules not only match but often exceed the performance of traditional models.

Evaluation

Argos was evaluated on public datasets such as KPI and Yahoo, as well as an internal Microsoft dataset. The results show a significant improvement in $F_1$ scores compared to state-of-the-art methods, with up to a 9.5-point increase on public datasets and a 28.3-point increase on internal datasets. These evaluations underscore Argos' effectiveness in addressing the challenges of time-series anomaly detection.

Figure 2: Comparison of the correctness rate and average test F1 score of the Training Engine with only the Detection Agent versus full Training Engine with Repair and Review Agents.

Conclusion

Argos represents a substantial advancement in time-series anomaly detection, effectively addressing the triad of explainability, reproducibility, and autonomy. Through the autonomous generation of detection rules via LLMs, Argos provides an efficient, adaptable, and robust solution for anomaly detection in cloud infrastructures. The system's design ensures higher accuracy and efficiency, making it a valuable tool for enhancing the reliability of cloud services. Future directions may focus on expanding Argos’ applications to other domains and integrating more sophisticated model fusion techniques to further improve its performance.