REBUS: A Robust Evaluation Benchmark of Understanding Symbols
Abstract: We propose a new benchmark evaluating the performance of multimodal LLMs on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and food. To achieve good performance on the benchmark of identifying the clued word or phrase, models must combine image recognition and string manipulation with hypothesis testing, multi-step reasoning, and an understanding of human cognition, making for a complex, multimodal evaluation of capabilities. We find that GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models. However, even the best model has a final accuracy of only 42\%, which goes down to just 7\% on hard puzzles, highlighting the need for substantial improvements in reasoning. Further, models rarely understand all parts of a puzzle, and are almost always incapable of retroactively explaining the correct answer. Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal LLMs.
- VQA: visual question answering. CoRR, abs/1505.00468, 2015. URL http://arxiv.org/abs/1505.00468.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
- Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
- Access: Advancing innovation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support. In Practice and Experience in Advanced Research Computing (PEARC ’23), page 4, New York, NY, USA, 2023. ACM. doi: 10.1145/3569951.3597559. July 23–27, 2023, Portland, OR, USA.
- Faithful reasoning using large language models, 2022.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Gemini: A family of highly capable multimodal models, 2023.
- Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. URL https://arxiv.org/abs/2009.03300.
- Language is not all you need: Aligning perception with language models, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- Mmbench: Is your multi-modal model an all-around player?, 2023c.
- Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
- Gaia: a benchmark for general ai assistants, 2023.
- GPT-4 technical report, 2023.
- Cogvlm: Visual expert for pretrained language models, 2023.
- V*: Guided visual search as a core mechanism in multimodal llms, 2023.
- Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
- Solving and generating npr sunday puzzles with large language models, 2023.
- Agieval: A human-centric benchmark for evaluating foundation models, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.