GPQA: A Graduate-Level Google-Proof Q&A Benchmark (2311.12022v1)

Published 20 Nov 2023 in cs.AI and cs.CL

Abstract: We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

Authors (8)

David Rein (6 papers)
Betty Li Hou (5 papers)
Asa Cooper Stickland (15 papers)
Jackson Petty (16 papers)
Richard Yuanzhe Pang (26 papers)
Julien Dirani (4 papers)
Julian Michael (28 papers)
Samuel R. Bowman (103 papers)

Citations (157)

View on Semantic Scholar

Summary

Essay on "GPQA: A Graduate-Level Google-Proof QA Benchmark"

The paper "GPQA: A Graduate-Level Google-Proof QA Benchmark" by Rein et al. presents a sophisticated and rigorously constructed dataset of 448 multiple-choice questions across the domains of biology, physics, and chemistry. The dataset, GPQA, is specifically designed to be highly challenging and resistant to easy look-ups via internet resources, thus encapsulating a "Google-proof" nature.

Dataset Construction and Validation

The construction of the GPQA dataset involved domain experts who possess or are pursuing PhDs in their respective fields. This ensures that the questions are both deep and precise, addressing niche and intricate facets of these scientific domains. The process encapsulated multiple stages to ensure quality and difficulty:

Question Writing: Experts crafted questions aimed at being difficult for non-experts but answerable by other experts in their domain.
Expert Validation: Each question underwent validation by two additional experts. This phase was critical to ascertain the objectivity and difficulty of the questions.
Revision Loop: The feedback obtained post-expert validation facilitated revisions to enhance clarity, objectivity, and difficulty.
Non-Expert Validation: To test the "Google-proof" characteristic, highly motivated non-experts attempted the questions with full access to the internet, albeit without AI assistants.

Evaluation of Performance

Impressively, human experts achieved an accuracy of 65%, which rises to 74% when accounting for identifiable mistakes. In stark contrast, non-experts, even with unrestricted internet access, reached an accuracy of only 34%. Furthermore, state-of-the-art AI systems like GPT-4 achieved a modest 39% accuracy, confirming the challenging nature of the dataset.

Measurement of Objectivity and Difficulty

The researchers measured question objectivity through post-hoc agreement among expert validators, yielding a conservative objectivity estimate of 74%. The validation process included self-reported confidence assessments by experts, contributing to understanding their calibration in evaluating these complex questions. Non-expert validators displayed a significantly lower calibration, indicating overconfidence despite their lower performance.

Baseline Evaluations with AI Systems

A detailed analysis of baseline AI system performance revealed notable insights:

Closed-book Setting: Models like GPT-3.5 and GPT-4, even with few-shot chain-of-thought prompting, performed significantly lower than human experts, underscoring the difficulty of the questions.
Open-book Setting: When GPT-4 was equipped with search capabilities, its performance showed only a marginal improvement, highlighting the effectiveness of the dataset in challenging AI systems.

Implications and Future Directions

The implications of this research are multifaceted:

Scalable Oversight: GPQA provides a robust testbed for scalable oversight experiments, allowing the development of methods for human supervision of AI outputs in domains requiring high expertise.
Ensuring Truthfulness: The dataset's difficulty ensures that questions are near the edge of current human expertise, making it invaluable for experiments aimed at extracting truthful information from advanced AI systems.
Enhancing Human-AI Collaboration: By challenging both humans and AI systems, GPQA can foster development in protocols and systems that facilitate reliable human-AI collaboration in creating new scientific knowledge.

Limitations

Despite its strengths, the dataset has limitations. Its compact size (448 main set questions) limits its use for large-scale training, though it remains powerful for evaluation purposes. The representation of experts across domains might introduce biases, and its applicability to truly superhuman AI systems remains an ongoing challenge.

Conclusion

GPQA stands out as a meticulously crafted benchmark for testing the limits of human and AI expertise in scientific domains. By anchoring its design in expert validation and rigorous difficulty criteria, the dataset provides a crucial resource for advancing scalable oversight and enhancing the fidelity of AI-generated insights in complex scientific endeavors. As the frontier of artificial intelligence advances, GPQA will remain a vital tool in ensuring the alignment and reliability of superhuman AI systems.

PDF Markdown

Related Papers

GitHub

GitHub - idavidrein/gpqa: Baselines and analysis for the Google-proof Q&A (GPQA) dataset (101 stars)

Tweets

https://twitter.com/idavidrein/status/1764675668175094169

https://twitter.com/morqon/status/1847295011244667176

https://twitter.com/SilasAlberti/status/1781054127755169919

https://twitter.com/Iliane_5/status/1823418933090902482

https://twitter.com/psychosort/status/1858248033462317556

https://twitter.com/jwt0625/status/1842049382084182247

YouTube

Show All Videos