Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

REBUS: A Robust Evaluation Benchmark of Understanding Symbols (2401.05604v2)

Published 11 Jan 2024 in cs.CL, cs.AI, cs.CV, and cs.CY

Abstract: We propose a new benchmark evaluating the performance of multimodal LLMs on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and food. To achieve good performance on the benchmark of identifying the clued word or phrase, models must combine image recognition and string manipulation with hypothesis testing, multi-step reasoning, and an understanding of human cognition, making for a complex, multimodal evaluation of capabilities. We find that GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models. However, even the best model has a final accuracy of only 42\%, which goes down to just 7\% on hard puzzles, highlighting the need for substantial improvements in reasoning. Further, models rarely understand all parts of a puzzle, and are almost always incapable of retroactively explaining the correct answer. Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal LLMs.

Introduction

Innovations in AI have given rise to multimodal LLMs (MLLMs) capable of processing both text and visual inputs. However, there's a substantial need for benchmarks to assess the diverse and complex reasoning abilities of these models. An interesting application of MLLMs is their potential to decipher rebus puzzles—challenges that combine visual clues with wordplay, and which necessitate a range of cognitive abilities for successful resolution.

The REBUS Benchmark

To investigate MLLMs' abilities in this complex domain, researchers have developed the REBUS benchmark consisting of 333 rebus puzzles across various categories. This novel dataset requires models to engage in visual recognition, hypothesis testing, multi-step reasoning, and general understanding of human cognition. The authors discovered that even the most advanced proprietary models, namely GPT-4V and Gemini Pro, display relatively modest success—with accuracy rates of 24% and 13.2%, respectively. The results indicate there’s considerable room for improvement in these systems, especially when they confront new challenges that humans typically solve with greater ease.

Methodology and Findings

The evaluation encompassed numerous MLLMs, both open-source and proprietary, subjecting them to the puzzles under zero-shot conditions to assess their innate problem-solving skills. Astonishingly, performance for open-source models was significantly lower, rarely surpassing 2% accuracy. Noteworthy is the observation that while some models could produce answers within the correct category, they often failed to solve the puzzle accurately or provide a clear rationale for their solution—illustrating a gap in both knowledge and reasoning.

Implications and Future Directions

This work reveals that present-day MLLMs, despite their sophistication, still struggle with tasks requiring human-like flexibility and depth of understanding. The benchmarks laid out by REBUS expose the overconfidence of models in their problem-solving, their inability to revise incorrect approaches, and the shortcomings in their deductive processes. Moving forward, innovations that mimic how humans approach problems—perhaps by taking multiple perspectives or employing iterative search strategies—might pave the way for more competent multimodal reasoning in AI. As researchers delve into such avenues, the REBUS dataset serves as a critical tool in measuring progress and illuminating the path towards more cognitively adept LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. VQA: visual question answering. CoRR, abs/1505.00468, 2015. URL http://arxiv.org/abs/1505.00468.
  2. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  3. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  4. Access: Advancing innovation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support. In Practice and Experience in Advanced Research Computing (PEARC ’23), page 4, New York, NY, USA, 2023. ACM. doi: 10.1145/3569951.3597559. July 23–27, 2023, Portland, OR, USA.
  5. Faithful reasoning using large language models, 2022.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  7. Gemini: A family of highly capable multimodal models, 2023.
  8. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. URL https://arxiv.org/abs/2009.03300.
  9. Language is not all you need: Aligning perception with language models, 2023.
  10. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  11. Improved baselines with visual instruction tuning, 2023a.
  12. Visual instruction tuning, 2023b.
  13. Mmbench: Is your multi-modal model an all-around player?, 2023c.
  14. Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
  15. Gaia: a benchmark for general ai assistants, 2023.
  16. GPT-4 technical report, 2023.
  17. Cogvlm: Visual expert for pretrained language models, 2023.
  18. V*: Guided visual search as a core mechanism in multimodal llms, 2023.
  19. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark, 2023.
  20. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  21. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
  22. Solving and generating npr sunday puzzles with large language models, 2023.
  23. Agieval: A human-centric benchmark for evaluating foundation models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Andrew Gritsevskiy (8 papers)
  2. Arjun Panickssery (5 papers)
  3. Aaron Kirtland (7 papers)
  4. Derik Kauffman (2 papers)
  5. Hans Gundlach (3 papers)
  6. Irina Gritsevskaya (1 paper)
  7. Joe Cavanagh (2 papers)
  8. Jonathan Chiang (1 paper)
  9. Lydia La Roux (1 paper)
  10. Michelle Hung (1 paper)
Github Logo Streamline Icon: https://streamlinehq.com