A Survey of AIOps in the Era of Large Language Models (2507.12472v1)

Published 23 Jun 2025 in cs.SE and cs.CL

Abstract: As LLMs grow increasingly sophisticated and pervasive, their application to various Artificial Intelligence for IT Operations (AIOps) tasks has garnered significant attention. However, a comprehensive understanding of the impact, potential, and limitations of LLMs in AIOps remains in its infancy. To address this gap, we conducted a detailed survey of LLM4AIOps, focusing on how LLMs can optimize processes and improve outcomes in this domain. We analyzed 183 research papers published between January 2020 and December 2024 to answer four key research questions (RQs). In RQ1, we examine the diverse failure data sources utilized, including advanced LLM-based processing techniques for legacy data and the incorporation of new data sources enabled by LLMs. RQ2 explores the evolution of AIOps tasks, highlighting the emergence of novel tasks and the publication trends across these tasks. RQ3 investigates the various LLM-based methods applied to address AIOps challenges. Finally, RQ4 reviews evaluation methodologies tailored to assess LLM-integrated AIOps approaches. Based on our findings, we discuss the state-of-the-art advancements and trends, identify gaps in existing research, and propose promising directions for future exploration.

Summary

The paper demonstrates that LLMs substantially enhance AIOps tasks by combining traditional system data with human-generated information for improved failure detection and diagnostics.
It employs advanced prompt-based methods and fine-tuning techniques to extract structured insights from logs and other diverse data sources.
The study identifies key challenges such as high computational costs and integration complexities, paving the way for future optimizations in automated IT operations.

A Survey of AIOps in the Era of LLMs

Introduction

The emergence of LLMs, in particular their robust natural language processing capabilities, has brought significant attention to their potential application in Artificial Intelligence for IT Operations (AIOps). This paper offers a comprehensive survey of how LLMs impact AIOps tasks, spanning from failure detection to root cause analysis and automated remediation.

Transformations in Data with LLM Integration

The rise of LLMs has expanded the data sources utilized in AIOps. Traditional system-generated data, such as metrics, logs, and traces, are complemented by human-generated data like software documentation and incident reports. A notable advancement is in processing traditional data sources, where techniques like log parsing have been augmented with LLMs to provide structured representations of log data. Studies have demonstrated that LLMs can successfully parse logs with high accuracy through methods such as prompt-based adaptive parsing, hierarchical candidate sampling, and the use of in-context learning.

Figure 1: Log-based Failure Perception and Root Cause Analysis: The Common Workflow.

Moreover, new data sources have been integrated, such as configuration and source code, enhancing failure diagnosis through deeper semantic understanding and automated code analysis. This expanded data usage not only enriches diagnostic capabilities but also leverages LLMs' pre-trained knowledge for efficient anomaly detection and failure analysis.

Evolving Tasks in AIOps with LLMs

LLMs have significantly transformed traditional AIOps tasks. While failure perception tasks such as detection and prediction continue to evolve, new opportunities have arisen in root cause analysis through the use of LLMs for generating detailed root cause reports. This involves the synthesis of complex data into natural language explanations, assisting operators in understanding system failures more effectively than traditional methods, which often relied on predefined categorizations and simpler models.

Figure 2: Evolution of Root Cause Analysis with the Rise of LLMs.

In the domain of Assisted Remediation, LLMs facilitate automated script generation and command recommendation, streamlining the remediation process. LLMs can generate, validate, and even execute remediation scripts, thereby increasing automation levels significantly compared to previous manual-intensive processes.

LLM-based Methods for AIOps

Various LLM-based methods have been developed, encompassing foundation models, fine-tuning approaches, and prompt-based methods. Foundation models provide pre-trained capabilities that can be adapted to specific AIOps tasks through full or parameter-efficient fine-tuning. Prompt-based approaches, leveraging in-context learning and task instruction prompting, enable LLMs to perform tasks without extensive retraining, utilizing carefully constructed prompts to guide model responses.

Figure 3: Various Types of Auto Remediation Approaches.

Embedding-based methods transform AIOps data into semantic representations amenable to LLM interpretation, while knowledge-based approaches integrate historical data retrieval and tool augmentation to enhance model reasoning and decision-making.

Evaluating LLM-based AIOps

The integration of LLMs in AIOps has necessitated new evaluation methodologies, including emergent metrics for generation tasks and manual evaluation for interpretative tasks demanding human judgment. While traditional metrics persist for classification tasks, manual evaluations now play a critical role in assessing generation quality and contextual relevance in LLM-driven solutions.

Challenges and Future Directions

Despite the remarkable progress, challenges such as high computational costs and the integration of more diverse data sources remain. Further research is needed to optimize LLM computational efficiency and develop methods for incorporating traces data effectively. Additionally, enhancing LLM's generalizability during software evolution and integrating them within existing AIOps toolchains are crucial for advancing automated IT operations management.

Conclusion

The application of LLMs in AIOps holds substantial promise for enhancing IT operations through improved anomaly detection, diagnostics, and automated remediation. Nonetheless, addressing existing challenges and optimizing the synergy between LLMs and traditional approaches will be pivotal for realizing their full potential in operational settings.