IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations (2404.01266v3)
Abstract: Current foundation models exhibit impressive capabilities when prompted either with text only or with both image and text inputs. But do their capabilities change depending on the input modality? In this work, we propose $\textbf{IsoBench}$, a benchmark dataset containing problems from four major areas: math, science, algorithms, and games. Each example is presented with multiple $\textbf{isomorphic representations}$ of inputs, such as visual, textual, and mathematical presentations. IsoBench provides fine-grained feedback to diagnose performance gaps caused by the form of the representation. Across various foundation models, we observe that on the same problem, models have a consistent preference towards textual representations. Most prominently, when evaluated on all IsoBench problems, Claude-3 Opus performs 28.7 points worse when provided with images instead of text; similarly, GPT-4 Turbo is 18.7 points worse and Gemini Pro is 14.9 points worse. Finally, we present two prompting techniques, $\textit{IsoCombination}$ and $\textit{IsoScratchPad}$, which improve model performance by considering combinations of, and translations between, different input representations.
- Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
- An asymmetrical relationship between verbal and visual thinking: Converging evidence from behavior and fmri. NeuroImage, 152:619–627, 2017. ISSN 1053-8119. doi: https://doi.org/10.1016/j.neuroimage.2017.03.029. URL https://www.sciencedirect.com/science/article/pii/S1053811917302379.
- Palm 2 technical report, 2023.
- Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
- Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
- Simplicity bias in transformers and their ability to learn sparse Boolean functions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023.
- Towards understanding the word sensitivity of attention layers: A study via random features, 2024.
- Data curation alone can stabilize in-context learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8123–8144, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.452. URL https://aclanthology.org/2023.acl-long.452.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023.
- Is gpt-4 a good data analyst? arXiv preprint arXiv:2305.15038, 2023.
- The picture superiority effect in recognition memory: A developmental study using the response signal procedure. Cognitive Development, 24(3):265–273, 2009. ISSN 0885-2014. doi: https://doi.org/10.1016/j.cogdev.2009.05.002. URL https://www.sciencedirect.com/science/article/pii/S0885201409000471.
- Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models. arXiv preprint arXiv:2403.01777, 2024.
- Transformers learn higher-order optimization methods for in-context learning: A study with linear models, 2023.
- What can transformers learn in-context? a case study of simple function classes. ArXiv, abs/2208.01066, 2022.
- Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037, 2022.
- Gemini Team Google. Gemini: A family of highly capable multimodal models, 2023.
- Instruction tuning with lexicons for zero-shot style classification. arXiv preprint arXiv:2305.14592, 2023.
- Orca: Interpreting prompted language models via locating supporting data evidence in the ocean of pretraining data. arXiv preprint arXiv:2205.12600, 2022.
- How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=p4PckNQR8k.
- Neftune: Noisy embeddings improve instruction finetuning. arXiv preprint arXiv:2310.05914, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Approximating cky with transformers. arXiv preprint arXiv:2305.02386, 2023.
- Improved instruction ordering in recipe-grounded conversation. arXiv preprint arXiv:2305.17280, 2023.
- Teaching arithmetic to small transformers. ArXiv, abs/2307.03381, 2023.
- Quantifying & modeling multimodal interactions: An information decomposition framework. Advances in Neural Information Processing Systems, 36, 2024.
- Countr: Transformer-based generalised visual counting. arXiv preprint arXiv:2208.13721, 2022.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- DeLLMa: A framework for decision making under uncertainty with large language models, 2024b.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023c.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
- MathVista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024.
- Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
- Reframing instructional prompts to gptk’s language. arXiv preprint arXiv:2109.07830, 2021.
- Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
- OpenAI. Gpt 3.5 turbo. openai.com, 2023a. URL https://help.openai.com/en/articles/8555514-gpt-3-5-turbo-updates.
- OpenAI. Gpt-4 technical report, 2023b.
- Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022.
- Reka. Reka flash: An efficient and capable multimodal language model. 2024. URL https://reka.ai/reka-flash-an-efficient-and-capable-multimodal-language-model/.
- Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.
- Shifting the baseline: Single modality performance on visual navigation & qa. North American Chapter of the Association for Computational Linguistics, abs/1811.00613, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Simplicity bias of transformers to learn low sensitivity functions, 2024.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, 2022.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Tree of thoughts: Deliberate problem solving with large language models, 2023.
- Gpt3mix: Leveraging large-scale language models for text augmentation. arXiv preprint arXiv:2104.08826, 2021.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
- MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
- Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024a.
- How far are we from intelligent visual deductive reasoning? arXiv preprint arXiv:2403.04732, 2024b.