Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics (2008.06448v3)

Published 14 Aug 2020 in cs.SE

Abstract: Logs have been widely adopted in software system development and maintenance because of the rich runtime information they record. In recent years, the increase of software size and complexity leads to the rapid growth of the volume of logs. To handle these large volumes of logs efficiently and effectively, a line of research focuses on developing intelligent and automated log analysis techniques. However, only a few of these techniques have reached successful deployments in industry due to the lack of public log datasets and open benchmarking upon them. To fill this significant gap and facilitate more research on AI-driven log analytics, we have collected and released loghub, a large collection of system log datasets. In particular, loghub provides 19 real-world log datasets collected from a wide range of software systems, including distributed systems, supercomputers, operating systems, mobile systems, server applications, and standalone software. In this paper, we summarize the statistics of these datasets, introduce some practical usage scenarios of the loghub datasets, and present our benchmarking results on loghub to benefit the researchers and practitioners in this field. Up to the time of this paper writing, the loghub datasets have been downloaded for roughly 90,000 times in total by hundreds of organizations from both industry and academia. The loghub datasets are available at https://github.com/logpai/loghub.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jieming Zhu (68 papers)
  2. Shilin He (25 papers)
  3. Pinjia He (47 papers)
  4. Jinyang Liu (51 papers)
  5. Michael R. Lyu (176 papers)
Citations (51)

Summary

Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics

This paper introduces Loghub, a comprehensive collection of system log datasets designed to advance research in AI-driven log analytics. The authors aim to address the notable gap in publicly available log datasets, which has hindered the deployment and evaluation of log analysis techniques in real-world scenarios. Loghub encompasses 19 diverse datasets, totaling over 77 GB, sourced from distributed systems, supercomputers, operating systems, mobile systems, server applications, and standalone software.

Overview of Loghub

Loghub provides an extensive resource for researchers, offering both labeled and unlabeled datasets. It includes logs from widely-used systems such as Hadoop, Spark, and OpenStack, among others. Six of these datasets are explicitly labeled for tasks such as anomaly detection and duplicate issue identification, presenting immediate applications for supervised learning models. The remaining unlabeled datasets support exploratory research in parsing, compression, and unsupervised anomaly detection, encompassing varied structures and complexities that mirror real-world challenges.

Practical Applications and Benchmarking

The paper outlines several key applications for Loghub:

  • Log Parsing: Transforming unstructured log messages into structured data remains a fundamental step in log analytics. Loghub's diverse datasets facilitate the evaluation of parsing accuracy across various systems.
  • Log Compression: As log data volume increases, specialized compression algorithms that leverage log-specific structures offer enhanced efficiency over general methods. Loghub enables comparative studies of such approaches.
  • Anomaly Detection: A critical area for maintaining system reliability, anomaly detection benefits from the labeled datasets which allow accurate benchmarking of both supervised and unsupervised models.
  • Duplicate Issue Identification: Efficient clustering of log sequences identifies redundant issues, streamlining system maintenance efforts.

Benchmarking results demonstrate distinct differences in the effectiveness of existing algorithms across these tasks. For log parsing, Drain and IPLoM are highlighted for their accuracy, although challenges remain in parsing complex datasets. Logzip showcases superior compression ratios, and supervised approaches exhibit higher precision in anomaly detection tasks compared to unsupervised counterparts.

Implications and Future Directions

The contribution of Loghub extends beyond immediate applications, propelling both theoretical and practical advancements in AI-driven log analytics. By bridging the gap between existing algorithms and real-world application needs, it serves as a benchmark for the development of improved methodologies.

The trajectory for future research involves expanding this repository with even more diverse log datasets and establishing a benchmarking leaderboard to evaluate the effectiveness of emerging log analysis models comprehensively. By doing so, Loghub not only supports incremental improvements but also fosters innovative breakthroughs that address unresolved questions, such as enhancing unsupervised anomaly detection and improving parsing for complex log structures.

In conclusion, Loghub represents an essential step forward in equipping researchers and practitioners with the tools needed to harness AI for robust and scalable log analytics, paving the way for more reliable and efficient system maintenance practices.

Github Logo Streamline Icon: https://streamlinehq.com