Current foundation models exhibit impressive capabilities when prompted either with text only or with both image and text inputs. But do their capabilities change depending on the input modality? In this work, we propose $\textbf{IsoBench}$, a benchmark dataset containing problems from four major areas: math, science, algorithms, and games. Each example is presented with multiple $\textbf{isomorphic representations}$ of inputs, such as visual, textual, and mathematical presentations. IsoBench provides fine-grained feedback to diagnose performance gaps caused by the form of the representation. Across various foundation models, we observe that on the same problem, models have a consistent preference towards textual representations. Most prominently, when evaluated on all IsoBench problems, Claude-3 Opus performs 28.7 points worse when provided with images instead of text; similarly, GPT-4 Turbo is 18.7 points worse and Gemini Pro is 14.9 points worse. Finally, we present two prompting techniques, $\textit{IsoCombination}$ and $\textit{IsoScratchPad}$, which improve model performance by considering combinations of, and translations between, different input representations.
IsoBench is a benchmark tool designed for evaluating the performance of multimodal foundation models across different tasks requiring text, image, or combined understanding.
It covers four domains: mathematics, games, algorithms, and science, emphasizing isomorphic representations to test models' capabilities in handling equivalent inputs in various formats.
Findings show a general preference for textual over visual representations among the models, highlighting challenges in vision model integration, input format sensitivity, and multimodal fusion techniques.
IsoBench introduces strategies like IsoCombination and IsoScratchPad to address these gaps, with IsoCombination particularly improving performance by combining multiple representations into a single input.
IsoBench is a benchmark designed to systematically evaluate the capabilities of multimodal foundation models across a diverse range of tasks that require understanding texts, images, or combinations thereof. This benchmark spans four domains: mathematics, science, algorithms, and games. Unique to IsoBench is its emphasis on isomorphic representations, where the same problem is presented in different modalities, including both visual and textual formats. By doing so, IsoBench provides a granular assessment of how well these models handle semantically equivalent inputs in distinct representations, revealing preferences or biases toward specific modalities.
IsoBench comprises four major domains, each testing different aspects of model capabilities:
Across the evaluated multimodal foundation models, a consistent preference for textual representations over visual ones was observed, contradicting human tendencies to benefit from visual information processing. This discrepancy raises questions about the current multimodal fusion mechanisms in these models and their ability to leverage visual inputs effectively. The findings from IsoBench highlight several limitations and challenges:
To mitigate the performance discrepancies observed between input modalities, two strategies were introduced: IsoCombination (IsoCB) and IsoScratchPad (IsoSP). IsoCB explores the effect of combining multiple isomorphic representations into a single input, aiming to provide models with a richer set of information. IsoSP, on the other hand, employs a two-step process where a model first translates a visual input into text, leveraging the higher performing text representations for downstream tasks. These strategies showed promising improvements, especially IsoCB, which significantly reduced the performance gap in certain tasks.
The findings from IsoBench underscore the need for advances in the representations and fusion techniques used by multimodal foundation models to more effectively process and integrate information across modalities. The observed preference for textual inputs points to potential biases in current models, possibly stemming from imbalances in pre-training data or limitations in the models' architectural design.
Future research should focus on developing more sophisticated multimodal fusion mechanisms that can capitalize on the unique advantages of each modality. Additionally, expanding the diversity of tasks and representations in benchmarks like IsoBench will be crucial for comprehensively assessing and improving the capabilities of multimodal foundation models.
In summary, IsoBench brings to light critical challenges in current multimodal foundation models and proposes avenues for research to enhance their understanding and reasoning capabilities across diverse input modalities. With continued development and evaluation, we can move closer to models that truly comprehend and reason with the richness of human communication.
An asymmetrical relationship between verbal and visual thinking: Converging evidence from behavior and fmri. NeuroImage, 152:619–627, 2017. ISSN 1053-8119. doi: https://doi.org/10.1016/j.neuroimage.2017.03.029. https://www.sciencedirect.com/science/article/pii/S1053811917302379.
Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
Introducing our multimodal models, 2023. https://www.adept.ai/blog/fuyu-8b.
Data curation alone can stabilize in-context learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8123–8144, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.452. https://aclanthology.org/2023.acl-long.452.
The picture superiority effect in recognition memory: A developmental study using the response signal procedure. Cognitive Development, 24(3):265–273, 2009. ISSN 0885-2014. doi: https://doi.org/10.1016/j.cogdev.2009.05.002. https://www.sciencedirect.com/science/article/pii/S0885201409000471.
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=p4PckNQR8k.
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. https://llava-vl.github.io/blog/2024-01-30-llava-next/.
Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
OpenAI. Gpt 3.5 turbo. openai.com, 2023a. https://help.openai.com/en/articles/8555514-gpt-3-5-turbo-updates.
Reka. Reka flash: An efficient and capable multimodal language model. 2024. https://reka.ai/reka-flash-an-efficient-and-capable-multimodal-language-model/.