OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models (2310.07637v5)
Abstract: Information Technology (IT) Operations (Ops), particularly Artificial Intelligence for IT Operations (AIOps), is the guarantee for maintaining the orderly and stable operation of existing information systems. According to Gartner's prediction, the use of AI technology for automated IT operations has become a new trend. LLMs that have exhibited remarkable capabilities in NLP-related tasks, are showing great potential in the field of AIOps, such as in aspects of root cause analysis of failures, generation of operations and maintenance scripts, and summarizing of alert information. Nevertheless, the performance of current LLMs in Ops tasks is yet to be determined. In this paper, we present OpsEval, a comprehensive task-oriented Ops benchmark designed for LLMs. For the first time, OpsEval assesses LLMs' proficiency in various crucial scenarios at different ability levels. The benchmark includes 7184 multi-choice questions and 1736 question-answering (QA) formats in English and Chinese. By conducting a comprehensive performance evaluation of the current leading LLMs, we show how various LLM techniques can affect the performance of Ops, and discussed findings related to various topics, including model quantification, QA evaluation, and hallucination issues. To ensure the credibility of our evaluation, we invite dozens of domain experts to manually review our questions. At the same time, we have open-sourced 20% of the test QA to assist current researchers in preliminary evaluations of their OpsLLM models. The remaining 80% of the data, which is not disclosed, is used to eliminate the issue of the test set leakage. Additionally, we have constructed an online leaderboard that is updated in real-time and will continue to be updated, ensuring that any newly emerging LLMs will be evaluated promptly. Both our dataset and leaderboard have been made public.
- 2023. QwenLM/Qwen-7B. https://github.com/QwenLM/Qwen-7B
- 2024a. baidu/ERNIE-Bot-4.0. https://cloud.baidu.com/doc/WENXINWORKSHOP/s/clntwmv7t
- 2024b. THUDM/ChatGLM3-6B. https://github.com/THUDM/ChatGLM3
- Benchmarking Foundation Models with Language-Model-as-an-Examiner. arXiv:2306.04181 [cs.CL]
- Baichuan. 2023. Baichuan 2: Open Large-scale Language Models. arXiv preprint arXiv:2309.10305 (2023). https://arxiv.org/abs/2309.10305
- A Survey on Evaluation of Large Language Models. arXiv:2307.03109 [cs.CL]
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 320–335.
- Hugo Touvron et.al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
- Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
- C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv e-prints (2023), arXiv–2305.
- Andrew Lerner. 2017. AIOps Platforms—Gartner.
- Huatuo-26M, a Large-scale Chinese Medical QA Dataset. arXiv e-prints (2023), arXiv–2305.
- Holistic Evaluation of Language Models. arXiv e-prints (2022), arXiv–2211.
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
- Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]
- An Empirical Study of NetOps Capability of Pre-Trained Large Language Models. CoRR abs/2309.05557 (2023). https://doi.org/10.48550/arXiv.2309.05557
- OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. OpenAI Blog (2022). https://openai.com/blog/chatgpt/
- OpenAI. 2023a. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
- OpenAI. 2023b. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf
- Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
- Instruction Tuning with GPT-4. arXiv preprint arXiv:2304.03277 (2023).
- Large Language Models Encode Clinical Knowledge. arXiv preprint arXiv:2212.13138 (2022).
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv e-prints (2022), arXiv–2206.
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
- InternLM Team. 2023. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. https://github.com/InternLM/InternLM.
- CMB: A Comprehensive Medical Benchmark in Chinese. arXiv e-prints (2023), arXiv–2308.
- Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL]
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
- Skywork: A More Open Bilingual Foundation Model. arXiv:2310.19341 [cs.CL]
- Hui Zeng. 2023. Measuring Massive Multitask Chinese Understanding. arXiv e-prints (2023), arXiv–2304.
- Evaluating the Generation Capabilities of Large Chinese Language Models. arXiv e-prints (2023), arXiv–2308.
- FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models. arXiv e-prints (2023), arXiv–2308.
- AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv e-prints (2023), arXiv–2304.