Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

System 2 thinking in OpenAI's o1-preview model: Near-perfect performance on a mathematics exam (2410.07114v5)

Published 19 Sep 2024 in cs.CY, cs.AI, and cs.CL

Abstract: The processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, LLMs were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the o1 model series, designed to handle System 2-like reasoning. While OpenAI's benchmarks are promising, independent validation is still needed. In this study, we tested the o1-preview model twice on the Dutch 'Mathematics B' final exam. It scored a near-perfect 76 and 74 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the GPT-4o model scored 66 and 62 out of 76, well above the Dutch students' average of 40.63 points. Neither model had access to the exam figures. Since there was a risk of model contami-nation (i.e., the knowledge cutoff for o1-preview and GPT-4o was after the exam was published online), we repeated the procedure with a new Mathematics B exam that was published after the cutoff date. The results again indicated that o1-preview performed strongly (97.8th percentile), which suggests that contamination was not a factor. We also show that there is some variability in the output of o1-preview, which means that sometimes there is 'luck' (the answer is correct) or 'bad luck' (the output has diverged into something that is incorrect). We demonstrate that the self-consistency approach, where repeated prompts are given and the most common answer is selected, is a useful strategy for identifying the correct answer. It is concluded that while OpenAI's new model series holds great potential, certain risks must be considered.

Citations (2)

Summary

  • The paper demonstrates that OpenAI’s o1-Preview achieved near-perfect scores (76/76 on first attempt and 74/76 on repetition) on a challenging Dutch Mathematics B exam.
  • The study employs rigorous self-consistency techniques and chain-of-thought analysis to validate the model’s advanced analytical and computational reasoning.
  • The results highlight the model’s potential to transform academic testing and automated problem-solving, while also raising important ethical and safety considerations.

Analysis of System 2 Reasoning in OpenAI's o1-Preview Model for Mathematical Competence

The academic paper titled "System 2 Thinking in OpenAI’s o1-Preview Model: Near-Perfect Performance on a Mathematics Exam" scrutinizes the capabilities of OpenAI’s o1-preview model in mathematical reasoning. This analysis is contextualized within the dual-process theory of cognition, focusing on System 2 reasoning, which involves deliberate, analytical thought processes. The authors provide a comparative evaluation of the o1-preview and GPT-4o models against the demanding Dutch ‘Mathematics B’ final exam, offering quantitative insights and considerations regarding potential implications.

Methodological Evaluation

In their methodological approach, the researchers utilize the Dutch VWO Mathematics B exam as a benchmark to assess the capabilities of the o1-preview model in System 2 reasoning. By drawing on this challenging assessment, the paper provides a focused examination of the model's performance in problem-solving and computation without access to visual inputs. The test results, an impressive 76 out of 76 for o1-preview on its first attempt and 74 upon repetition, outperform the GPT-4o model (66 and 62 points) and significantly exceed the average performance of Dutch students.

Despite the limitations arising from missing visual input data and the intrinsic variability associated with model outputs (due to the temperature setting), these results underscore the enhanced reasoning capabilities of the o1-preview model, further corroborated by robust self-consistency methods such as repeated prompting. This methodological approach confirms the advancement in System 2-like reasoning, seminally differentiating o1-preview from its predecessors.

Results and Discussion

The paper reports the o1-preview’s ability to achieve superior results in 97.8th and 98.3rd percentiles on subsequent Mathematics B exams, firmly establishing its competence. This performance is partly attributed to internal chain-of-thought (CoT) reasoning processes, distinguishing it from models lacking such capabilities. Furthermore, o1-mini, another variant of the model, demonstrated good performance metrics but slightly less proficiency compared to o1-preview. Interestingly, these evaluations were conducted across two separate Mathematics B exams to mitigate concerns about knowledge contamination from exam familiarity due to knowledge cut-off dates.

A noteworthy aspect of this research is the exploration of reasoning tokens and computation time as proxies for performance validation. Detailed analyses indicate that variability in outputs could be managed through the self-consistency method, ensuring reliability in problem-solving.

Implications and Future Outlook

With the presentation of these empirical results, the paper posits a pivotal discussion on the practical implications of System 2-like AI models. In academic and industrial settings, the capability to execute advanced reasoning tasks as effectively as demonstrated could fundamentally transform research methodologies, automated problem-solving, and educational instruction paradigms. However, these advancements also raise critical ethical and safety considerations concerning the potential misuse of sophisticated problem-solving AI, suggesting a dual-edged impact that requires proactive risk assessment and mitigation strategies.

From a future research perspective, examining the scalability of reasoning through employing various data inputs and complex real-world tasks could offer deeper insights into the applicability and adaptability of System 2-like models. Moreover, exploring improvements in inference speed and accuracy, potentially through enhanced infrastructure like faster GPUs, is essential to catalyze broad-spectrum applicability for System 2-like models in AI.

In conclusion, this paper rigorously evaluates the promising advances of OpenAI’s o1-preview model in System 2 reasoning. The implications extend beyond immediate applications, heralding potential shifts in computational cognition's role and capabilities within AI research and applications. Future explorations will likely center around optimizing these models for wider use cases while ethically safeguarding against risks of misuse.