AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models (2304.06364v2)

Published 13 Apr 2023 in cs.CL and cs.AI

Abstract: Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of AGI. Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/ruixiangcui/AGIEval.

PDF Abstract

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

The paper introduces AGIEval, a comprehensive benchmark tailored to evaluate the general capabilities of foundation models in performing human-level tasks. Foundation models, such as GPT-4, ChatGPT, and Text-Davinci-003, exhibit substantial competence across various domains. However, traditional benchmarks have limitations in capturing human-centric cognition due to reliance on artificial datasets. AGIEval addresses this by utilizing tasks derived from standardized human exams, including college entrance exams, law school admission tests, and professional qualification tests, to provide a more accurate gauge of "Artificial General Intelligence" (AGI) readiness.

Key Findings and Contributions

AGIEval encompasses a diverse set of tasks across multiple subjects, offering a robust evaluation of the linguistic and cognitive abilities of models. It includes bilingual tasks in English and Chinese, enabling an extensive assessment of capabilities across languages.

Performance of Models: GPT-4 demonstrates superior performance, surpassing average human achievements in specific exams such as the SAT and LSAT. The model attains remarkable accuracy levels—95% in SAT Math and 92.5% in Gaokao English. However, it still lags in tasks demanding advanced reasoning and domain-specific knowledge.
Evaluation Process: The paper's evaluation methodology involves both zero-shot and few-shot learning settings, revealing that the models' zero-shot capabilities are progressively aligning with few-shot performances.
Chain-of-Thought Prompting: The introduction of CoT prompting enhances the models' reasoning abilities, particularly in mathematical tasks, though its effectiveness varies by task, model, and language.

Implications for Future Research

The paper suggests several directions for improving AI models to better tackle human-centric tasks:

Incorporation of Domain Knowledge: The integration of domain-specific knowledge, such as legal or scientific information, could enhance the models' capabilities in specialized areas.
Multimodal and Multilingual Expansion: Expanding evaluation frameworks to encompass multimodal tasks and developing robust multilingual reasoning capabilities can extend the applicability of foundation models.
Enhanced Reasoning Capabilities: Advancing models' logical reasoning and problem-solving skills, particularly for complex, multi-step tasks, remains a priority. This involves innovative training strategies and incorporation of symbolic reasoning elements.
Robust Evaluation Metrics: Developing more nuanced automatic evaluation metrics to capture the full spectrum of models' reasoning and decision-making abilities.

Conclusion

AGIEval provides a human-centric benchmark delivering valuable insights into the strengths and limitations of foundation models like GPT-4. By aligning evaluation tasks more closely with human cognition, the benchmark advances the pursuit of more reliable and effective AI systems. As foundation models continue to evolve, AGIEval will play a crucial role in steering future research and development toward achieving closer alignment with human-like cognitive functions.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Wanjun Zhong (49 papers)
Ruixiang Cui (12 papers)
Yiduo Guo (11 papers)
Yaobo Liang (29 papers)
Shuai Lu (90 papers)
Yanlin Wang (76 papers)
Amin Saied (6 papers)
Weizhu Chen (128 papers)
Nan Duan (172 papers)

Citations (370)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ruixiangcui/AGIEval (684 stars)

Tweets

https://twitter.com/kimmonismus/status/1780964776148881721

YouTube

Show All Videos