CS-Bench: A Comprehensive Benchmark for LLMs towards Computer Science Mastery
The advance of LLMs has sparked substantial interest across diverse fields, including the application and evaluation of these models in computer science (CS). This paper presents CS-Bench, a novel bilingual (Chinese-English) benchmark meticulously designed to evaluate LLMs' proficiency in computer science. This benchmark covers a wide array of CS subfields, offering a comprehensive assessment of both knowledge and reasoning abilities in different linguistic contexts.
Design and Structure of CS-Bench
CS-Bench is built on several foundational principles to ensure a robust evaluation:
- Coverage of Key Domains: The benchmark spans four critical domains in the field of CS—Data Structure and Algorithm (DSA), Computer Organization (CO), Computer Network (CN), and Operating System (OS). These domains are further divided into 26 subfields.
- Diverse Task Formats: It includes multiple-choice questions (MC), assertion (true/false) questions, fill-in-the-blank (FITB) tasks, and open-ended questions to simulate real-world scenarios and assess LLMs' robustness to various formats.
- CS-specific Reasoning: CS-Bench distinguishes between knowledge-type questions and reasoning-type questions, the latter evaluating the application of CS knowledge for logical and arithmetic reasoning.
- Multilinguality Support: The benchmark supports both Chinese and English, allowing for the evaluation of LLMs in different language environments.
The data collection involved three diverse sources to ensure the richness and novelty of the dataset. The combined dataset comprises approximately 5,000 questions, enhancing the assessment's granularity and applicability.
Evaluation and Findings
The evaluation of over 30 mainstream LLMs on CS-Bench revealed several insights:
- Performance Differentiation: CS-Bench effectively differentiates LLMs' capabilities within the CS field. Even top-performing models like GPT-4o show significant room for improvement, indicating the benchmark's potential for pushing the boundaries of LLM development.
- Scale-Performance Relationship: There is a consistent logarithmic growth pattern in CS performance relative to the model scale. Smaller models can predict and guide the development of larger models using established scale-score fitting functions.
- Reason for Failures: A major limitation in LLMs’ performance is the lack of domain-specific CS knowledge. Enhancing this domain-specific reasoning capability requires more than just improving general reasoning skills.
In terms of task performance, LLMs show varying levels of proficiency:
- Task Formats: LLMs generally performed best on assertion questions, followed by multiple-choice, open-ended, and fill-in-the-blank tasks. The disparity is more pronounced in weaker models.
- Knowledge vs. Reasoning: LLMs perform better on knowledge-type questions compared to reasoning-type questions. The correlation between the two suggests that enhancing reasoning capabilities involves improving domain-specific knowledge.
Cross-Domain Capability Analysis
The benchmark also explores the relationship between LLMs' capabilities in CS and their proficiency in mathematics and coding:
- General Models: There is a high correlation between scores on CS-Bench and other benchmarks such as GSM8K (math) and HumanEval (code), indicating that strong performance in mathematics and coding often translates to better CS proficiency.
- Expert Models: Code-specialized and math-specialized models show that CS proficiency benefits from expertise in these domains, particularly in areas of DSA and OS.
Implications and Future Work
The implications of CS-Bench are profound:
- Theoretical: The benchmark provides a nuanced understanding of LLMs' strengths and weaknesses in CS, guiding future research towards improving domain-specific reasoning.
- Practical: CS-Bench can serve as a cornerstone for developing LLMs that are better suited for CS applications, enhancing their usability in education, industry, and scientific research.
Looking forward, CS-Bench sets the stage for subsequent research aimed at fine-grained capability enhancement, particularly in CS reasoning. It becomes a pivotal resource for the continuous assessment and development of LLMs, ensuring they are not only linguistically proficient but also technically adept.
Overall, CS-Bench highlights the intricate interplay between domain-specific knowledge and reasoning abilities in LLMs, providing a rigorous and comprehensive framework for evaluating and improving their performance in computer science.