Emergent Mind

Abstract

Current foundation models exhibit impressive capabilities when prompted either with text only or with both image and text inputs. But do their capabilities change depending on the input modality? In this work, we propose $\textbf{IsoBench}$, a benchmark dataset containing problems from four major areas: math, science, algorithms, and games. Each example is presented with multiple $\textbf{isomorphic representations}$ of inputs, such as visual, textual, and mathematical presentations. IsoBench provides fine-grained feedback to diagnose performance gaps caused by the form of the representation. Across various foundation models, we observe that on the same problem, models have a consistent preference towards textual representations. Most prominently, when evaluated on all IsoBench problems, Claude-3 Opus performs 28.7 points worse when provided with images instead of text; similarly, GPT-4 Turbo is 18.7 points worse and Gemini Pro is 14.9 points worse. Finally, we present two prompting techniques, $\textit{IsoCombination}$ and $\textit{IsoScratchPad}$, which improve model performance by considering combinations of, and translations between, different input representations.

IsoBench covers Mathematical Functions, Science, Graph Algorithms, Chess with varied tasks and isomorphic image/text representations.

Overview

  • IsoBench is a benchmark tool designed for evaluating the performance of multimodal foundation models across different tasks requiring text, image, or combined understanding.

  • It covers four domains: mathematics, games, algorithms, and science, emphasizing isomorphic representations to test models' capabilities in handling equivalent inputs in various formats.

  • Findings show a general preference for textual over visual representations among the models, highlighting challenges in vision model integration, input format sensitivity, and multimodal fusion techniques.

  • IsoBench introduces strategies like IsoCombination and IsoScratchPad to address these gaps, with IsoCombination particularly improving performance by combining multiple representations into a single input.

Evaluating Multimodal Foundation Models with IsoBench: Insights and Challenges

Introduction to IsoBench

IsoBench is a benchmark designed to systematically evaluate the capabilities of multimodal foundation models across a diverse range of tasks that require understanding texts, images, or combinations thereof. This benchmark spans four domains: mathematics, science, algorithms, and games. Unique to IsoBench is its emphasis on isomorphic representations, where the same problem is presented in different modalities, including both visual and textual formats. By doing so, IsoBench provides a granular assessment of how well these models handle semantically equivalent inputs in distinct representations, revealing preferences or biases toward specific modalities.

Domains and Tasks

IsoBench comprises four major domains, each testing different aspects of model capabilities:

  1. Mathematics: Focusing on continuous mathematics and plot understanding, tasks include classifying function properties and identifying breakpoints in piecewise functions.
  2. Games: Chess puzzles and winner identification tasks test strategic reasoning and understanding of complex game states.
  3. Algorithms: Graph algorithms such as connectivity, maximum flow, and isomorphism challenge the models' algorithmic reasoning skills.
  4. Science: Chemistry and physics questions assess the models' understanding of scientific concepts and their ability to interpret diagrams and visual information.

Key Observations and Findings

Across the evaluated multimodal foundation models, a consistent preference for textual representations over visual ones was observed, contradicting human tendencies to benefit from visual information processing. This discrepancy raises questions about the current multimodal fusion mechanisms in these models and their ability to leverage visual inputs effectively. The findings from IsoBench highlight several limitations and challenges:

  • Vision Model Shortcomings: Visual recognition errors and a lack of capability in utilizing low-level visual features for reasoning suggest that the vision components may not be optimally integrated or trained.
  • Input Format Sensitivity: Models display varying performance across different textual representations, indicating potential biases or overfitting to specific formats encountered during training.
  • Multimodal Fusion Gaps: The observed performance gaps between visual and textual representations suggest that current fusion techniques may not effectively leverage the complementary strengths of different modalities.

Addressing the Gaps: IsoCombination and IsoScratchPad

To mitigate the performance discrepancies observed between input modalities, two strategies were introduced: IsoCombination (IsoCB) and IsoScratchPad (IsoSP). IsoCB explores the effect of combining multiple isomorphic representations into a single input, aiming to provide models with a richer set of information. IsoSP, on the other hand, employs a two-step process where a model first translates a visual input into text, leveraging the higher performing text representations for downstream tasks. These strategies showed promising improvements, especially IsoCB, which significantly reduced the performance gap in certain tasks.

Implications and Future Directions

The findings from IsoBench underscore the need for advances in the representations and fusion techniques used by multimodal foundation models to more effectively process and integrate information across modalities. The observed preference for textual inputs points to potential biases in current models, possibly stemming from imbalances in pre-training data or limitations in the models' architectural design.

Future research should focus on developing more sophisticated multimodal fusion mechanisms that can capitalize on the unique advantages of each modality. Additionally, expanding the diversity of tasks and representations in benchmarks like IsoBench will be crucial for comprehensively assessing and improving the capabilities of multimodal foundation models.

In summary, IsoBench brings to light critical challenges in current multimodal foundation models and proposes avenues for research to enhance their understanding and reasoning capabilities across diverse input modalities. With continued development and evaluation, we can move closer to models that truly comprehend and reason with the richness of human communication.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. CM3: A Causal Masked Multimodal Model of the Internet
  2. An asymmetrical relationship between verbal and visual thinking: Converging evidence from behavior and fmri. NeuroImage, 152:619–627, 2017. ISSN 1053-8119. doi: https://doi.org/10.1016/j.neuroimage.2017.03.029. https://www.sciencedirect.com/science/article/pii/S1053811917302379.

  3. Palm 2 technical report
  4. Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.

  5. Introducing our multimodal models, 2023. https://www.adept.ai/blog/fuyu-8b.

  6. Simplicity bias in transformers and their ability to learn sparse Boolean functions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics
  7. Towards understanding the word sensitivity of attention layers: A study via random features
  8. Data curation alone can stabilize in-context learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8123–8144, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.452. https://aclanthology.org/2023.acl-long.452.

  9. PaLI: A Jointly-Scaled Multilingual Language-Image Model
  10. PaLI-X: On Scaling up a Multilingual Vision and Language Model
  11. Is GPT-4 a Good Data Analyst?
  12. The picture superiority effect in recognition memory: A developmental study using the response signal procedure. Cognitive Development, 24(3):265–273, 2009. ISSN 0885-2014. doi: https://doi.org/10.1016/j.cogdev.2009.05.002. https://www.sciencedirect.com/science/article/pii/S0885201409000471.

  13. NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models
  14. Transformers learn higher-order optimization methods for in-context learning: A study with linear models
  15. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
  16. Demystifying Prompts in Language Models via Perplexity Estimation
  17. Gemini Team Google. Gemini: A family of highly capable multimodal models
  18. Instruction Tuning with Lexicons for Zero-Shot Style Classification
  19. ORCA: Interpreting Prompted Language Models via Locating Supporting Data Evidence in the Ocean of Pretraining Data
  20. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=p4PckNQR8k.

  21. NEFTune: Noisy Embeddings Improve Instruction Finetuning
  22. Mixtral of Experts
  23. Approximating CKY with Transformers
  24. Improved Instruction Ordering in Recipe-Grounded Conversation
  25. Teaching Arithmetic to Small Transformers
  26. Quantifying & modeling multimodal interactions: An information decomposition framework. Advances in Neural Information Processing Systems, 36
  27. CounTR: Transformer-based Generalised Visual Counting
  28. Improved baselines with visual instruction tuning, 2023a
  29. Visual instruction tuning, 2023b
  30. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. https://llava-vl.github.io/blog/2024-01-30-llava-next/.

  31. DeLLMa: A framework for decision making under uncertainty with large language models, 2024b
  32. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023c.
  33. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS)
  34. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts
  35. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
  36. Reframing Instructional Prompts to GPTk's Language
  37. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.

  38. OpenAI. Gpt 3.5 turbo. openai.com, 2023a. https://help.openai.com/en/articles/8555514-gpt-3-5-turbo-updates.

  39. OpenAI. Gpt-4 technical report, 2023b.
  40. Impact of Pretraining Term Frequencies on Few-Shot Reasoning
  41. Reka. Reka flash: An efficient and capable multimodal language model. 2024. https://reka.ai/reka-flash-an-efficient-and-capable-multimodal-language-model/.

  42. Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
  43. Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
  44. Llama 2: Open Foundation and Fine-Tuned Chat Models
  45. Simplicity bias of transformers to learn low sensitivity functions
  46. Transformers learn in-context by gradient descent. In International Conference on Machine Learning
  47. Chain-of-thought prompting elicits reasoning in large language models
  48. Tree of thoughts: Deliberate problem solving with large language models
  49. GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation
  50. Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
  51. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
  52. MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
  53. How Far Are We from Intelligent Visual Deductive Reasoning?

Show All 53