Introduction
Innovations in AI have given rise to multimodal LLMs (MLLMs) capable of processing both text and visual inputs. However, there's a substantial need for benchmarks to assess the diverse and complex reasoning abilities of these models. An interesting application of MLLMs is their potential to decipher rebus puzzles—challenges that combine visual clues with wordplay, and which necessitate a range of cognitive abilities for successful resolution.
The REBUS Benchmark
To investigate MLLMs' abilities in this complex domain, researchers have developed the REBUS benchmark consisting of 333 rebus puzzles across various categories. This novel dataset requires models to engage in visual recognition, hypothesis testing, multi-step reasoning, and general understanding of human cognition. The authors discovered that even the most advanced proprietary models, namely GPT-4V and Gemini Pro, display relatively modest success—with accuracy rates of 24% and 13.2%, respectively. The results indicate there’s considerable room for improvement in these systems, especially when they confront new challenges that humans typically solve with greater ease.
Methodology and Findings
The evaluation encompassed numerous MLLMs, both open-source and proprietary, subjecting them to the puzzles under zero-shot conditions to assess their innate problem-solving skills. Astonishingly, performance for open-source models was significantly lower, rarely surpassing 2% accuracy. Noteworthy is the observation that while some models could produce answers within the correct category, they often failed to solve the puzzle accurately or provide a clear rationale for their solution—illustrating a gap in both knowledge and reasoning.
Implications and Future Directions
This work reveals that present-day MLLMs, despite their sophistication, still struggle with tasks requiring human-like flexibility and depth of understanding. The benchmarks laid out by REBUS expose the overconfidence of models in their problem-solving, their inability to revise incorrect approaches, and the shortcomings in their deductive processes. Moving forward, innovations that mimic how humans approach problems—perhaps by taking multiple perspectives or employing iterative search strategies—might pave the way for more competent multimodal reasoning in AI. As researchers delve into such avenues, the REBUS dataset serves as a critical tool in measuring progress and illuminating the path towards more cognitively adept LLMs.