Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges (2502.08859v2)

Published 13 Feb 2025 in cs.AI and cs.CL

Abstract: As LLMs master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier LLMs. We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events that probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning. Unlike existing reasoning and knowledge benchmarks, puzzle solving challenges models to discover hidden connections between seemingly unrelated pieces of information to uncover solution paths. The benchmark comprises 1184 puzzles of varying complexity -- each typically requiring teams of skilled solvers hours to days to complete -- with unambiguous, verifiable solutions that enable efficient evaluation. State-of-the-art LLMs achieve extremely low accuracy on these puzzles, even lower than other difficult benchmarks such as Humanity's Last Exam, unveiling models' shortcomings when challenged with problems requiring unstructured and lateral reasoning.

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

The paper "EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges" introduces EnigmaEval, a dataset designed to evaluate the capacities of frontier LLMs in solving complex puzzles that require advanced reasoning capabilities. This benchmark seeks to address the diminishing evaluative potential of existing reasoning benchmarks as current models increasingly approach their performance ceilings. By sourcing puzzles from a diverse mix of puzzle-solving events, EnigmaEval aims to challenge the multimodal reasoning abilities of state-of-the-art LLMs in ways that differ significantly from traditional evaluation methods.

Dataset Composition and Methodology

EnigmaEval consists of 1184 puzzles drawn from a variety of puzzle-solving events and competitions, including PuzzledPint, MIT Mystery Hunt, and the Labor Day Extravaganza, among others. These puzzles incorporate diverse formats and demand agile reasoning capabilities spanning logic, wordplay, mathematics, and cultural references. The puzzles are presented in two forms: (1) as original PDFs or webpage screenshots, and (2) as human-transcribed text-image formats that maintain the semantic and structural coherence of the puzzles while removing extraneous formatting.

The collection process involved scraping puzzles from publicly accessible archives and ensuring compliance with licensing and permissions. In some instances, puzzles that relied on audio, video, or interactive elements were excluded due to current model limitations. This curation emphasizes the challenge of solving puzzles based on reasoning, rather than format navigability alone.

Evaluation of LLMs

Significant emphasis is placed on evaluating the multimodal reasoning capabilities of LLMs using EnigmaEval. The models are assessed on both their ability to understand and parse complex documents in raw formats and their reasoning capacity when simplified through transcription. The analysis reveals that current state-of-the-art models perform poorly on these complex puzzle challenges, with accuracy rates plummeting to as low as 7.0% on the easier portion of the dataset and to 0% on harder challenges. These results underscore a critical gap in the models' ability to synthesize implicit clues and leverage lateral reasoning, presenting a daunting frontier for AI research.

Implications and Future Directions

This benchmark serves as a potent reminder of the significant progress yet to be made in developing models capable of true creative and flexible reasoning. By evaluating models' abilities on diverse and unstructured multimodal challenges, EnigmaEval provides a new platform for understanding and addressing the deficiencies in LLM reasoning capabilities. This paradigm ensures sustained relevance in measuring progress in AI reasoning and highlights the need for continued research in complex problem-solving and cross-domain knowledge integration.

Considering future developments, EnigmaEval opens numerous avenues for advancing LLMs. Enhancements in Optical Character Recognition (OCR) abilities, improvements in the integration of multimodal understanding, and development of strategies that incorporate pattern recognition across disparate knowledge domains are potential areas of exploration. Additionally, expanding this benchmark could involve introducing more complex puzzle types or incorporating dynamic and interactive puzzle elements.

Conclusion

EnigmaEval is a critical contribution to the suite of tools available for evaluating the cognitive frontiers of LLMs. Through its challenging and varied puzzle composition, it exposes current limitations in model reasoning abilities, providing a clear directive for future AI advancements. The benchmark not only raises the bar for evaluating existing models but also guides future research endeavors towards more nuanced and sophisticated model capabilities in reasoning, creativity, and flexible thinking.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Clinton J. Wang (7 papers)
  2. Dean Lee (104 papers)
  3. Cristina Menghini (13 papers)
  4. Johannes Mols (2 papers)
  5. Jack Doughty (1 paper)
  6. Adam Khoja (7 papers)
  7. Jayson Lynch (61 papers)
  8. Sean Hendryx (12 papers)
  9. Summer Yue (12 papers)
  10. Dan Hendrycks (63 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com