This paper tracks the performance of GPT-[n] and o-[n] models on multimodal puzzles, specifically focusing on visual reasoning tasks. The authors evaluate the models on PuzzleVQA and AlgoPuzzleVQA datasets, which require both fine-grained visual perception and abstract or algorithmic reasoning. The goal is to assess how the reasoning capabilities of LLMs evolve with model iterations.
The paper poses three primary research questions:
- How well do current state-of-the-art models perform on visual reasoning tasks?
- What specific types of pattern recognition and reasoning pose the greatest challenges?
- How can different models' multimodal reasoning capabilities be systematically evaluated and compared?
The authors find an upward trend in reasoning capabilities across model iterations, with notable performance jumps between GPT series models and o1. However, the o1 model still faces challenges with multimodal puzzles that require abstract reasoning, and its performance on algorithmic puzzles remains limited. The superior performance of o1 comes at a steep computational cost, approximately 750 times that of GPT-4o, raising efficiency concerns.
The paper highlights the key findings:
- Performance steadily improves from GPT-4-Turbo to GPT-4o to o1. The transition from GPT-4o to o1 is a substantial advancement but comes with a 750x increase in inference cost.
- The o1 model, despite improved reasoning, still underperforms compared to humans on PuzzleVQA.
- GPT-4-Turbo and GPT-4o encounter perception and inductive reasoning bottlenecks.
- The primary bottleneck for o1 is perception; providing ground truth perception improves o1's reasoning, outperforming GPT-4-Turbo and GPT-4o by 18-20\%.
- The o1 model struggles with reasoning involving visual shapes and sizes.
- As puzzle complexity increases, such as in AlgoPuzzleVQA or dual-concept puzzles in PuzzleVQA, all models show a performance decline.
Datasets
The paper employs two datasets: PuzzleVQA and AlgoPuzzleVQA. PuzzleVQA contains 2,000 test instances organized into 10 puzzle categories, four focusing on single-concept patterns (numbers, colors, sizes, and shapes) and six on dual-concept patterns combining two distinct concepts. AlgoPuzzleVQA includes 1,800 test instances across 18 distinct puzzles, combining visual and algorithmic categories, where each puzzle integrates at least one visual and one algorithmic category. Visual categories include color, position, shape/size, and text, while algorithmic categories comprise arithmetic, boolean logic, combinatorics, graphs, optimization, search, and sets.
Experimental Setup
The puzzles are presented to the models in both multiple-choice and open-ended formats. In the multiple-choice setup, zero-shot chain of thought (CoT) prompting is used for GPT-[n] models, while o-[n] models, trained to perform reasoning internally, do not use CoT prompting. The accuracy of predicting the correct final answer serves as the evaluation metric. For open-ended responses, GPT-4o is used to compare the generated responses with ground truth answers, evaluating whether the response aligns with the correct answer.
The models investigated include GPT-4-Turbo, GPT-4o, and o1. The paper notes these models were selected due to their advancements and significant contributions to the LLM field.
Results and Discussion
The results indicate that all models generally perform better in the multiple-choice setting compared to the open-ended setting. The o1 model experiences the largest performance decline between the multiple-choice and open-ended settings on AlgoPuzzleVQA. In PuzzleVQA, human performance reaches 91.4% accuracy in the multiple-choice setting, while the best model, o1, achieves 79.2%. Size and shape categories are the most challenging for most models, with performance declining further in dual-concept puzzles.
In AlgoPuzzleVQA, the performance of all models remains relatively low, with o1 achieving the highest score of 55.3% in the multiple-choice setting. There is a notable performance improvement with o1 compared to GPT-4-Turbo and GPT-4o. However, in the open-ended setting, a significant performance drop is observed, particularly on puzzles like Chain Link and Wood Slide, where performance is near 0%.
The multiple-choice format provides a helpful cue, guiding the model toward greater precision. The paper identifies perception as the primary bottleneck across all models. Providing visual details of the puzzle in the input prompt improves results by 22% to 30% for all models. GPT-4-Turbo and GPT-4o show weaknesses in inductive reasoning. The o1 model demonstrates strong inductive reasoning capabilities, with only a moderate performance improvement with visual perception and inductive reasoning guidance.
Overall, the authors conclude that visual perception remains a key limitation across all models, and improvements in visual understanding are needed.