The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles (2502.01081v2)

Published 3 Feb 2025 in cs.CV, cs.AI, and cs.CL

Abstract: The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in LLMs towards advanced reasoning capabilities. Notably, models like o3 have demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models (including o1, o3, and o4-mini) on challenging multimodal puzzles from PuzzleVQA and AlgoPuzzleVQA, which demand fine-grained visual perception. Our results reveal that o-[n] series, particularly later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong scalability in multimodal reasoning. Nonetheless, despite these substantial advancements and the superior capabilities demonstrated by the o-[n] series, our findings highlight that even these leading models face persistent challenges. Difficulties are particularly evident in tasks requiring precise visual perception, robust compositional reasoning across multiple visual attributes, and solving complex algorithmic or highly combinatorial puzzles, indicating critical areas for future AGI development. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available at https://github.com/declare-lab/LLM-PuzzleTest.

PDF Abstract

This paper tracks the performance of GPT-[n] and o-[n] models on multimodal puzzles, specifically focusing on visual reasoning tasks. The authors evaluate the models on PuzzleVQA and AlgoPuzzleVQA datasets, which require both fine-grained visual perception and abstract or algorithmic reasoning. The goal is to assess how the reasoning capabilities of LLMs evolve with model iterations.

The paper poses three primary research questions:

How well do current state-of-the-art models perform on visual reasoning tasks?
What specific types of pattern recognition and reasoning pose the greatest challenges?
How can different models' multimodal reasoning capabilities be systematically evaluated and compared?

The authors find an upward trend in reasoning capabilities across model iterations, with notable performance jumps between GPT series models and o1. However, the o1 model still faces challenges with multimodal puzzles that require abstract reasoning, and its performance on algorithmic puzzles remains limited. The superior performance of o1 comes at a steep computational cost, approximately 750 times that of GPT-4o, raising efficiency concerns.

The paper highlights the key findings:

Performance steadily improves from GPT-4-Turbo to GPT-4o to o1. The transition from GPT-4o to o1 is a substantial advancement but comes with a 750x increase in inference cost.
The o1 model, despite improved reasoning, still underperforms compared to humans on PuzzleVQA.
GPT-4-Turbo and GPT-4o encounter perception and inductive reasoning bottlenecks.
The primary bottleneck for o1 is perception; providing ground truth perception improves o1's reasoning, outperforming GPT-4-Turbo and GPT-4o by 18-20\%.
The o1 model struggles with reasoning involving visual shapes and sizes.
As puzzle complexity increases, such as in AlgoPuzzleVQA or dual-concept puzzles in PuzzleVQA, all models show a performance decline.

Datasets

The paper employs two datasets: PuzzleVQA and AlgoPuzzleVQA. PuzzleVQA contains 2,000 test instances organized into 10 puzzle categories, four focusing on single-concept patterns (numbers, colors, sizes, and shapes) and six on dual-concept patterns combining two distinct concepts. AlgoPuzzleVQA includes 1,800 test instances across 18 distinct puzzles, combining visual and algorithmic categories, where each puzzle integrates at least one visual and one algorithmic category. Visual categories include color, position, shape/size, and text, while algorithmic categories comprise arithmetic, boolean logic, combinatorics, graphs, optimization, search, and sets.

Experimental Setup

The puzzles are presented to the models in both multiple-choice and open-ended formats. In the multiple-choice setup, zero-shot chain of thought (CoT) prompting is used for GPT-[n] models, while o-[n] models, trained to perform reasoning internally, do not use CoT prompting. The accuracy of predicting the correct final answer serves as the evaluation metric. For open-ended responses, GPT-4o is used to compare the generated responses with ground truth answers, evaluating whether the response aligns with the correct answer.

The models investigated include GPT-4-Turbo, GPT-4o, and o1. The paper notes these models were selected due to their advancements and significant contributions to the LLM field.

Results and Discussion

The results indicate that all models generally perform better in the multiple-choice setting compared to the open-ended setting. The o1 model experiences the largest performance decline between the multiple-choice and open-ended settings on AlgoPuzzleVQA. In PuzzleVQA, human performance reaches 91.4% accuracy in the multiple-choice setting, while the best model, o1, achieves 79.2%. Size and shape categories are the most challenging for most models, with performance declining further in dual-concept puzzles.

In AlgoPuzzleVQA, the performance of all models remains relatively low, with o1 achieving the highest score of 55.3% in the multiple-choice setting. There is a notable performance improvement with o1 compared to GPT-4-Turbo and GPT-4o. However, in the open-ended setting, a significant performance drop is observed, particularly on puzzles like Chain Link and Wood Slide, where performance is near 0%.

The multiple-choice format provides a helpful cue, guiding the model toward greater precision. The paper identifies perception as the primary bottleneck across all models. Providing visual details of the puzzle in the input prompt improves results by 22% to 30% for all models. GPT-4-Turbo and GPT-4o show weaknesses in inductive reasoning. The o1 model demonstrates strong inductive reasoning capabilities, with only a moderate performance improvement with visual perception and inductive reasoning guidance.

Overall, the authors conclude that visual perception remains a key limitation across all models, and improvements in visual understanding are needed.