Papers
Topics
Authors
Recent
Search
2000 character limit reached

REBUS: A Robust Evaluation Benchmark of Understanding Symbols

Published 11 Jan 2024 in cs.CL, cs.AI, cs.CV, and cs.CY | (2401.05604v2)

Abstract: We propose a new benchmark evaluating the performance of multimodal LLMs on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and food. To achieve good performance on the benchmark of identifying the clued word or phrase, models must combine image recognition and string manipulation with hypothesis testing, multi-step reasoning, and an understanding of human cognition, making for a complex, multimodal evaluation of capabilities. We find that GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models. However, even the best model has a final accuracy of only 42\%, which goes down to just 7\% on hard puzzles, highlighting the need for substantial improvements in reasoning. Further, models rarely understand all parts of a puzzle, and are almost always incapable of retroactively explaining the correct answer. Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. VQA: visual question answering. CoRR, abs/1505.00468, 2015. URL http://arxiv.org/abs/1505.00468.
  2. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  3. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  4. Access: Advancing innovation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support. In Practice and Experience in Advanced Research Computing (PEARC ’23), page 4, New York, NY, USA, 2023. ACM. doi: 10.1145/3569951.3597559. July 23–27, 2023, Portland, OR, USA.
  5. Faithful reasoning using large language models, 2022.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  7. Gemini: A family of highly capable multimodal models, 2023.
  8. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. URL https://arxiv.org/abs/2009.03300.
  9. Language is not all you need: Aligning perception with language models, 2023.
  10. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  11. Improved baselines with visual instruction tuning, 2023a.
  12. Visual instruction tuning, 2023b.
  13. Mmbench: Is your multi-modal model an all-around player?, 2023c.
  14. Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
  15. Gaia: a benchmark for general ai assistants, 2023.
  16. GPT-4 technical report, 2023.
  17. Cogvlm: Visual expert for pretrained language models, 2023.
  18. V*: Guided visual search as a core mechanism in multimodal llms, 2023.
  19. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark, 2023.
  20. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  21. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
  22. Solving and generating npr sunday puzzles with large language models, 2023.
  23. Agieval: A human-centric benchmark for evaluating foundation models, 2023.

Summary

  • The paper demonstrates that REBUS, a novel benchmark of 333 rebus puzzles, challenges MLLMs in multi-step reasoning and visual recognition.
  • The evaluation shows advanced models like GPT-4V achieve modest accuracies (24% for GPT-4V and 13.2% for Gemini Pro), revealing significant performance gaps.
  • The study underscores the need for iterative, human-like problem-solving strategies to enhance model adaptability in handling symbolic puzzles.

Introduction

Innovations in AI have given rise to multimodal LLMs (MLLMs) capable of processing both text and visual inputs. However, there's a substantial need for benchmarks to assess the diverse and complex reasoning abilities of these models. An interesting application of MLLMs is their potential to decipher rebus puzzles—challenges that combine visual clues with wordplay, and which necessitate a range of cognitive abilities for successful resolution.

The REBUS Benchmark

To investigate MLLMs' abilities in this complex domain, researchers have developed the REBUS benchmark consisting of 333 rebus puzzles across various categories. This novel dataset requires models to engage in visual recognition, hypothesis testing, multi-step reasoning, and general understanding of human cognition. The authors discovered that even the most advanced proprietary models, namely GPT-4V and Gemini Pro, display relatively modest success—with accuracy rates of 24% and 13.2%, respectively. The results indicate there’s considerable room for improvement in these systems, especially when they confront new challenges that humans typically solve with greater ease.

Methodology and Findings

The evaluation encompassed numerous MLLMs, both open-source and proprietary, subjecting them to the puzzles under zero-shot conditions to assess their innate problem-solving skills. Astonishingly, performance for open-source models was significantly lower, rarely surpassing 2% accuracy. Noteworthy is the observation that while some models could produce answers within the correct category, they often failed to solve the puzzle accurately or provide a clear rationale for their solution—illustrating a gap in both knowledge and reasoning.

Implications and Future Directions

This work reveals that present-day MLLMs, despite their sophistication, still struggle with tasks requiring human-like flexibility and depth of understanding. The benchmarks laid out by REBUS expose the overconfidence of models in their problem-solving, their inability to revise incorrect approaches, and the shortcomings in their deductive processes. Moving forward, innovations that mimic how humans approach problems—perhaps by taking multiple perspectives or employing iterative search strategies—might pave the way for more competent multimodal reasoning in AI. As researchers explore such avenues, the REBUS dataset serves as a critical tool in measuring progress and illuminating the path towards more cognitively adept LLMs.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 3 likes about this paper.