Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts (2506.06211v1)

Published 6 Jun 2025 in cs.CL, cs.AI, and cs.CV

Abstract: Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.

Overview of "PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts"

The paper introduces "PuzzleWorld," a comprehensive benchmark designed to evaluate multimodal, open-ended reasoning capabilities in AI systems. PuzzleWorld is composed of 667 complex puzzlehunt-style problems extracted from Puzzled Pint, a series of monthly puzzle events, and aims to challenge AI models by requiring them to engage in multiple cognitive processes across diverse modalities. This benchmark diverges from traditional reasoning tests by featuring tasks that are deliberately open-ended, lacking clearly defined problem structures. Rather than merely completing tasks with explicit instructions, models must infer the task's nature from nuanced, multimodal cues and resolve them through iterative, creative reasoning—aptly reflecting real-world scenarios requiring exploratory data analysis and investigative problem-solving.

Dataset and Annotation

PuzzleWorld differentiates itself through the complexity and diversity of its tasks, which require solvers to combine textual, visual, and structured inputs while utilizing multiple reasoning skills, such as logic, wordplay, spatial reasoning, cryptic decoding, domain knowledge, and commonsense understanding. Each puzzle within the benchmark is meticulously annotated with final solutions, detailed reasoning traces, and cognitive skill labels. These annotations provide a granular perspective of the models' reasoning pathways and enable rigorously systematic diagnostic analysis.

A particular strength of PuzzleWorld lies in its annotation schema, which decomposes problem-solving into a series of discrete reasoning steps. Such granularity enables researchers to gauge models' capabilities not only in delivering correct final solutions but also in demonstrating coherent stepwise problem-solving processes.

Evaluation of Current AI Models

The research evaluates leading multimodal models such as GPT-o3 and GPT-4o against PuzzleWorld, indicating significant challenges. Most contemporary models achieve only 1-2% accuracy in deriving final answers, with the best-performing model, GPT-o3, achieving a mere 14% accuracy and a stepwise reasoning accuracy of 40%. This underscores the substantial gap between current AI capabilities and the types of complex, integrative reasoning humans can perform.

A distinct aspect of the benchmark is its dual focus on both final solution accuracy and intermediary reasoning step accuracy. Although the final accuracy rates illustrate the task difficulty, stepwise accuracy offers deeper insights into models' reasoning trajectories and potential areas for improvement.

Error Analysis and Model Limitations

In-depth error analysis reveals several areas where models falter in PuzzleWorld. Common issues include myopic reasoning—where models become fixated on initial hypotheses and fail to flexibly backtrack when encountering dead ends—and limitations inherent to language-based inference, particularly when models face tasks where visual and spatial reasoning are paramount. Furthermore, sketching—or the ability to maintain and manipulate persistent visual representations during problem-solving—is identified as a significant capability gap in current models.

Implications for AI Development

PuzzleWorld presents new frontiers in AI research by highlighting deficiencies in multimodal reasoning that extend beyond the conventional scope of language processing and logic. It invites further exploration into models capable of sketching, adapting across various modalities, and dynamically crafting hypotheses. Researchers are encouraged to use the detailed reasoning annotations for fine-tuning models to enhance stepwise reasoning accuracy, as initial experiments using PuzzleWorld's reasoning traces have shown notable improvements.

Overall, PuzzleWorld serves as an invaluable resource for AI researchers seeking to develop more sophisticated reasoning systems, supporting advancements toward general-purpose, multimodal reasoning AI that can function effectively in unstructured, real-world problem-solving contexts. Future developments may include enhancing models to better simulate human-like reasoning, shifting the focus from merely increasing model accuracy to fostering adaptability and creative problem-solving.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Hengzhi Li (5 papers)
  2. Brendon Jiang (1 paper)
  3. Alexander Naehu (1 paper)
  4. Regan Song (1 paper)
  5. Justin Zhang (11 papers)
  6. Megan Tjandrasuwita (8 papers)
  7. Chanakya Ekbote (9 papers)
  8. Steven-Shine Chen (2 papers)
  9. Adithya Balachandran (4 papers)
  10. Wei Dai (230 papers)
  11. Rebecca Chang (1 paper)
  12. Paul Pu Liang (103 papers)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com