Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models (2310.07637v5)

Published 11 Oct 2023 in cs.AI and cs.NI

Abstract: Information Technology (IT) Operations (Ops), particularly Artificial Intelligence for IT Operations (AIOps), is the guarantee for maintaining the orderly and stable operation of existing information systems. According to Gartner's prediction, the use of AI technology for automated IT operations has become a new trend. LLMs that have exhibited remarkable capabilities in NLP-related tasks, are showing great potential in the field of AIOps, such as in aspects of root cause analysis of failures, generation of operations and maintenance scripts, and summarizing of alert information. Nevertheless, the performance of current LLMs in Ops tasks is yet to be determined. In this paper, we present OpsEval, a comprehensive task-oriented Ops benchmark designed for LLMs. For the first time, OpsEval assesses LLMs' proficiency in various crucial scenarios at different ability levels. The benchmark includes 7184 multi-choice questions and 1736 question-answering (QA) formats in English and Chinese. By conducting a comprehensive performance evaluation of the current leading LLMs, we show how various LLM techniques can affect the performance of Ops, and discussed findings related to various topics, including model quantification, QA evaluation, and hallucination issues. To ensure the credibility of our evaluation, we invite dozens of domain experts to manually review our questions. At the same time, we have open-sourced 20% of the test QA to assist current researchers in preliminary evaluations of their OpsLLM models. The remaining 80% of the data, which is not disclosed, is used to eliminate the issue of the test set leakage. Additionally, we have constructed an online leaderboard that is updated in real-time and will continue to be updated, ensuring that any newly emerging LLMs will be evaluated promptly. Both our dataset and leaderboard have been made public.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. 2023. QwenLM/Qwen-7B. https://github.com/QwenLM/Qwen-7B
  2. 2024a. baidu/ERNIE-Bot-4.0. https://cloud.baidu.com/doc/WENXINWORKSHOP/s/clntwmv7t
  3. 2024b. THUDM/ChatGLM3-6B. https://github.com/THUDM/ChatGLM3
  4. Benchmarking Foundation Models with Language-Model-as-an-Examiner. arXiv:2306.04181 [cs.CL]
  5. Baichuan. 2023. Baichuan 2: Open Large-scale Language Models. arXiv preprint arXiv:2309.10305 (2023). https://arxiv.org/abs/2309.10305
  6. A Survey on Evaluation of Large Language Models. arXiv:2307.03109 [cs.CL]
  7. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 320–335.
  8. Hugo Touvron et.al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
  9. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
  10. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
  11. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv e-prints (2023), arXiv–2305.
  12. Andrew Lerner. 2017. AIOps Platforms—Gartner.
  13. Huatuo-26M, a Large-scale Chinese Medical QA Dataset. arXiv e-prints (2023), arXiv–2305.
  14. Holistic Evaluation of Language Models. arXiv e-prints (2022), arXiv–2211.
  15. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
  16. Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]
  17. An Empirical Study of NetOps Capability of Pre-Trained Large Language Models. CoRR abs/2309.05557 (2023). https://doi.org/10.48550/arXiv.2309.05557
  18. OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. OpenAI Blog (2022). https://openai.com/blog/chatgpt/
  19. OpenAI. 2023a. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
  20. OpenAI. 2023b. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf
  21. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  22. Instruction Tuning with GPT-4. arXiv preprint arXiv:2304.03277 (2023).
  23. Large Language Models Encode Clinical Knowledge. arXiv preprint arXiv:2212.13138 (2022).
  24. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv e-prints (2022), arXiv–2206.
  25. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  26. InternLM Team. 2023. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. https://github.com/InternLM/InternLM.
  27. CMB: A Comprehensive Medical Benchmark in Chinese. arXiv e-prints (2023), arXiv–2308.
  28. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL]
  29. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
  30. Skywork: A More Open Bilingual Foundation Model. arXiv:2310.19341 [cs.CL]
  31. Hui Zeng. 2023. Measuring Massive Multitask Chinese Understanding. arXiv e-prints (2023), arXiv–2304.
  32. Evaluating the Generation Capabilities of Large Chinese Language Models. arXiv e-prints (2023), arXiv–2308.
  33. FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models. arXiv e-prints (2023), arXiv–2308.
  34. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv e-prints (2023), arXiv–2304.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com