Addressing Modality Imbalance in Vision LLMs Through Synthetic Tasks
In contemporary discussions surrounding Vision LLMs (VLMs), issues of "modality imbalance" have gained significant attention, particularly when applying VLMs to visual reasoning tasks. The research paper titled "Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?" addresses these concerns methodically by proposing a synthetic framework and associated tasks to scrutinize and ameliorate VLM performance discrepancies in cross-modality reasoning.
The overarching goal of this paper is to illuminate how VLMs, while adept at tasks like visual question answering (VQA) and image captioning, often exhibit decreased efficacy in performing multi-step reasoning in visual contexts. This is contextualized as a modality imbalance issue, where there is a disparity in the VLMs' reasoning abilities depending on whether information is conveyed visually or textually.
Synthetic Framework and Tasks
The authors propose a suite of tasks aimed at systematic evaluation of algorithmic visual reasoning (AVR). These include:
- Continual Table Readout (CTR): This task involves sequential reading of numbers from a starting cell to an ending cell in a grid, with inputs provided either as images or LaTeX text. The task tests the VLMs' ability to comprehend and navigate through tabular data.
- Grid Navigation (GN): In this task, VLMs navigate a graphical grid from a start to a destination point while collecting specified objects and avoiding obstacles. This requires spatial reasoning capabilities similar to pathfinding.
- Abstract Reasoning (AR): Mimicking human IQ tests, this task involves identifying patterns among geometric shapes with attributes like color and size across different panels. It requires mapping abstract relations and logical reasoning.
Each task is designed with 'SIMPLE' (simple) and 'HARD' (complex) variants to test the models' capability to generalize from simple to complex tasks. This framework crucially allows for direct comparison between text-only and image-only task variants, aiding in quantifying modality imbalance.
Key Findings and Approaches
Generalization Insights: A fundamental observation in the paper is that VLMs trained on SIMPLE tasks demonstrate a significant drop in performance when generalizing to HARD tasks, particularly when the tasks are presented in image format. This suggests a modality gap rooted in the differences in processing visual and textual data.
Mitigation Strategies: The researchers propose several training strategies incorporating mixed supervision (both text and image data) to bridge the modality gap:
- Image Reasoning via Text Conversion: This technique involves training the model to convert image information to text format and then reason about the text, leveraging the inherent reasoning capabilities of LLMs.
- Mix Supervision: By integrating both text and image inputs alongside image-to-text conversion tasks, Mix supervision aims to create cross-modality synergy. The paper reports this approach significantly improves VLMs' performance on HARD tasks set in image formats.
- Alignment-focused Training: Inspired by the gradient alignment insights in training, they suggest an alignment phase focusing solely on SIMPLE tasks to ensure coherence in text and image reasoning before tackling HARD tasks.
Implications and Future Directions
This research underscores the value of synthetic, structured tasks in dissecting the capabilities and limitations of VLMs. Addressing modality imbalance holds substantial implications for enhancing the robustness of multimodal AI systems in real-world applications — from autonomous systems requiring intricate spatial reasoning to visual data integration in complex analytics.
Future research directions include developing methodologies to internalize reasoning processes within VLMs, enhancing inference-time efficiency without relying on exhaustive input-to-text conversions, and expanding evaluation to incorporate more varied datasets reflective of realistic environments. With an iterative combination of theoretical insights and empirical method development, advancements in VLM training could lead to substantial shifts in how AI comprehends and integrates diverse modal inputs.