REBUS: A Robust Evaluation Benchmark of Understanding Symbols (2401.05604v2)

Published 11 Jan 2024 in cs.CL, cs.AI, cs.CV, and cs.CY

Abstract: We propose a new benchmark evaluating the performance of multimodal LLMs on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and food. To achieve good performance on the benchmark of identifying the clued word or phrase, models must combine image recognition and string manipulation with hypothesis testing, multi-step reasoning, and an understanding of human cognition, making for a complex, multimodal evaluation of capabilities. We find that GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models. However, even the best model has a final accuracy of only 42\%, which goes down to just 7\% on hard puzzles, highlighting the need for substantial improvements in reasoning. Further, models rarely understand all parts of a puzzle, and are almost always incapable of retroactively explaining the correct answer. Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal LLMs.

PDF HTML Abstract

Introduction

Innovations in AI have given rise to multimodal LLMs (MLLMs) capable of processing both text and visual inputs. However, there's a substantial need for benchmarks to assess the diverse and complex reasoning abilities of these models. An interesting application of MLLMs is their potential to decipher rebus puzzles—challenges that combine visual clues with wordplay, and which necessitate a range of cognitive abilities for successful resolution.

The REBUS Benchmark

To investigate MLLMs' abilities in this complex domain, researchers have developed the REBUS benchmark consisting of 333 rebus puzzles across various categories. This novel dataset requires models to engage in visual recognition, hypothesis testing, multi-step reasoning, and general understanding of human cognition. The authors discovered that even the most advanced proprietary models, namely GPT-4V and Gemini Pro, display relatively modest success—with accuracy rates of 24% and 13.2%, respectively. The results indicate there’s considerable room for improvement in these systems, especially when they confront new challenges that humans typically solve with greater ease.

Methodology and Findings

The evaluation encompassed numerous MLLMs, both open-source and proprietary, subjecting them to the puzzles under zero-shot conditions to assess their innate problem-solving skills. Astonishingly, performance for open-source models was significantly lower, rarely surpassing 2% accuracy. Noteworthy is the observation that while some models could produce answers within the correct category, they often failed to solve the puzzle accurately or provide a clear rationale for their solution—illustrating a gap in both knowledge and reasoning.

Implications and Future Directions

This work reveals that present-day MLLMs, despite their sophistication, still struggle with tasks requiring human-like flexibility and depth of understanding. The benchmarks laid out by REBUS expose the overconfidence of models in their problem-solving, their inability to revise incorrect approaches, and the shortcomings in their deductive processes. Moving forward, innovations that mimic how humans approach problems—perhaps by taking multiple perspectives or employing iterative search strategies—might pave the way for more competent multimodal reasoning in AI. As researchers delve into such avenues, the REBUS dataset serves as a critical tool in measuring progress and illuminating the path towards more cognitively adept LLMs.