CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery (2406.08587v1)

Published 12 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Computer Science (CS) stands as a testament to the intricacies of human intelligence, profoundly advancing the development of artificial intelligence and modern society. However, the current community of LLMs overly focuses on benchmarks for analyzing specific foundational skills (e.g. mathematics and code generation), neglecting an all-round evaluation of the computer science field. To bridge this gap, we introduce CS-Bench, the first bilingual (Chinese-English) benchmark dedicated to evaluating the performance of LLMs in computer science. CS-Bench comprises approximately 5K meticulously curated test samples, covering 26 subfields across 4 key areas of computer science, encompassing various task forms and divisions of knowledge and reasoning. Utilizing CS-Bench, we conduct a comprehensive evaluation of over 30 mainstream LLMs, revealing the relationship between CS performance and model scales. We also quantitatively analyze the reasons for failures in existing LLMs and highlight directions for improvements, including knowledge supplementation and CS-specific reasoning. Further cross-capability experiments show a high correlation between LLMs' capabilities in computer science and their abilities in mathematics and coding. Moreover, expert LLMs specialized in mathematics and coding also demonstrate strong performances in several CS subfields. Looking ahead, we envision CS-Bench serving as a cornerstone for LLM applications in the CS field and paving new avenues in assessing LLMs' diverse reasoning capabilities. The CS-Bench data and evaluation code are available at https://github.com/csbench/csbench.

PDF HTML Abstract

CS-Bench: A Comprehensive Benchmark for LLMs towards Computer Science Mastery

The advance of LLMs has sparked substantial interest across diverse fields, including the application and evaluation of these models in computer science (CS). This paper presents CS-Bench, a novel bilingual (Chinese-English) benchmark meticulously designed to evaluate LLMs' proficiency in computer science. This benchmark covers a wide array of CS subfields, offering a comprehensive assessment of both knowledge and reasoning abilities in different linguistic contexts.

Design and Structure of CS-Bench

CS-Bench is built on several foundational principles to ensure a robust evaluation:

Coverage of Key Domains: The benchmark spans four critical domains in the field of CS—Data Structure and Algorithm (DSA), Computer Organization (CO), Computer Network (CN), and Operating System (OS). These domains are further divided into 26 subfields.
Diverse Task Formats: It includes multiple-choice questions (MC), assertion (true/false) questions, fill-in-the-blank (FITB) tasks, and open-ended questions to simulate real-world scenarios and assess LLMs' robustness to various formats.
CS-specific Reasoning: CS-Bench distinguishes between knowledge-type questions and reasoning-type questions, the latter evaluating the application of CS knowledge for logical and arithmetic reasoning.
Multilinguality Support: The benchmark supports both Chinese and English, allowing for the evaluation of LLMs in different language environments.

The data collection involved three diverse sources to ensure the richness and novelty of the dataset. The combined dataset comprises approximately 5,000 questions, enhancing the assessment's granularity and applicability.

Evaluation and Findings

The evaluation of over 30 mainstream LLMs on CS-Bench revealed several insights:

Performance Differentiation: CS-Bench effectively differentiates LLMs' capabilities within the CS field. Even top-performing models like GPT-4o show significant room for improvement, indicating the benchmark's potential for pushing the boundaries of LLM development.
Scale-Performance Relationship: There is a consistent logarithmic growth pattern in CS performance relative to the model scale. Smaller models can predict and guide the development of larger models using established scale-score fitting functions.
Reason for Failures: A major limitation in LLMs’ performance is the lack of domain-specific CS knowledge. Enhancing this domain-specific reasoning capability requires more than just improving general reasoning skills.

In terms of task performance, LLMs show varying levels of proficiency:

Task Formats: LLMs generally performed best on assertion questions, followed by multiple-choice, open-ended, and fill-in-the-blank tasks. The disparity is more pronounced in weaker models.
Knowledge vs. Reasoning: LLMs perform better on knowledge-type questions compared to reasoning-type questions. The correlation between the two suggests that enhancing reasoning capabilities involves improving domain-specific knowledge.

Cross-Domain Capability Analysis

The benchmark also explores the relationship between LLMs' capabilities in CS and their proficiency in mathematics and coding:

General Models: There is a high correlation between scores on CS-Bench and other benchmarks such as GSM8K (math) and HumanEval (code), indicating that strong performance in mathematics and coding often translates to better CS proficiency.
Expert Models: Code-specialized and math-specialized models show that CS proficiency benefits from expertise in these domains, particularly in areas of DSA and OS.

Implications and Future Work

The implications of CS-Bench are profound:

Theoretical: The benchmark provides a nuanced understanding of LLMs' strengths and weaknesses in CS, guiding future research towards improving domain-specific reasoning.
Practical: CS-Bench can serve as a cornerstone for developing LLMs that are better suited for CS applications, enhancing their usability in education, industry, and scientific research.

Looking forward, CS-Bench sets the stage for subsequent research aimed at fine-grained capability enhancement, particularly in CS reasoning. It becomes a pivotal resource for the continuous assessment and development of LLMs, ensuring they are not only linguistically proficient but also technically adept.

Overall, CS-Bench highlights the intricate interplay between domain-specific knowledge and reasoning abilities in LLMs, providing a rigorous and comprehensive framework for evaluating and improving their performance in computer science.